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TASK  1  OMVPE  GROWTH  OF  III-V  ALLOYS  FOR  NEW  HIGH  SPEED 
ELECTRON  DEVICES 

J.  R.  Shealy 


Arsine  flow  requirement  for  the  flow  modulation  growth  of  high  purity 
GaAs  using  adduct-grade  triethylgallium 

B.  L  Pitts,  D.  T.  Emerson,  and  J.  R.  Shealy 

OMVPE  Facility,  School  of  Electrical  Engineering,  Cornell  University,  Ithaca,  Sew  York  14S53 
(Received  1  May  1992;  accepted  for  publication  14  August  1992) 

Using  arsine  and  triethylgallium  with  flow  modulation,  organometallic  vapor  phase  epitaxy  can 
produce  high  purity  GaAs  layers  with  V/1II  molar  ratios  near  unity.  We  have  estimated  that 
under  appropriate  growth  conditions  the  arsine  incorporation  efficiency  into  epitaxial  GaAs  can 
exceed  30%.  The  arsine  flow  requirement  for  obtaining  good  morphology  has  been  identified 
over  a  range  of  substrate  temperatures  using  adduct-grade  triethylgallium.  The  process 
described  reduces  the  environmental  impact  and  life  safety  risk  of  the  hydride  based 
organometallic  vapor  phase  epitaxial  method. 


Organometallic  vapor  phase  epitaxy  (OMVPE)  has 
demonstrated  the  ability  to  produce  a  variety  of  device 
quality  HI-V  compounds  and  structures.  With  a  carefully 
designed  gas  flow  switching  apparatus,  interface  abrupt¬ 
ness  approaching  a  perfect  compositional  change  across  a 
single  atomic  layer  has  been  realized.  Optimized  results  are 
often  achieved  using  reduced  growth  pressures.  It  has  been 
suggested  that  growing  at  reduced  pressures  often  results 
in  sharper  interfaces,  reduced  autodoping,  and  lower 
growth  rates  which  increase  the  accuracy  of  layer  control. 1 
Furthermore,  in  many  reactor  cell  designs  (e.g.,  vertical 
barrel)  reduced  pressure  is  required  to  eliminate  gas  recir¬ 
culation  due  to  convection  forces.  One  of  the  disadvantages 
in  growing  high  purity  III-V  compound  semiconductors  by 
low  pressure  OMVPE  is  the  increased  flow  requirement  of 
highly  toxic  hydrides  (e.g.,  arsine,  phosphine).  In  conven¬ 
tional  reduced  pressure  OMVPE  using  trimcthylgallium 
(TMG)  and  arsine  (AsHj),  high  molar  V/III  ratios  are 
necessary  to  obtain  high  purity  GaAs.2  Efforts  have  been 
made  to  reduce  AsH}  consumption,  including  precracking3 
of  the  arsine  and  substituting  triethylgallium  (TEG)  for 
TMG.4-7  None  of  these  methods  have  resulted  in  device 
quality  material  with  V/!II  ratios  near  unity.  Less  toxic 
group  V  liquid  sources  are  presently  available  which,  at 
V/III  ratios  of  10  or  greater,  yield  77  K  mobilities  greater 
than  100  000  cm2/V  s.!  Low  pressure  OMVPE  growth  is 
still  done  at  relatively  high  V/III  ratios.  This  leads  to  po¬ 
tential  safety  hazards  due  to  the  expulsion  of  the  excess 
arsine  that  does  not  participate  in  the  growth  process,  and 
to  the  increased  handling  of  the  source  containers.  Efforts 
to  minimize  high  pressure  cylinder  storage  include  an  on- 
demand  arsine  gas  generator,  but  a  low  77  K  mobility  was 
observed  (76  000  cm2/V  s).9  In  this  study  using  flow  mod¬ 
ulation  epitaxy  (FME),10  we  demonstrate  a  process  which 
does  not  require  excess  AsH3  and  which  produces  high 
quality  GaAs  epitaxial  layers  (77  K  mobilities  of  90  000 
cm2/V  s).  The  process  described  allows  for  small  quanti¬ 
ties  of  arsine  storage  in  the  facility  and  could  be  used  in 
conjunction  with  hydride  generator  technologies  to  mini¬ 
mize  the  safety  issues  involved  in  OMVPE  growth  of  many 
III-V  compounds. 

The  use  of  TEG  and  AsH}  has  been  proven  to  give 
lower  background  carbon  concentrations  in  GaAs  than  the 


widely  used  TMG.4"7  The  first  high  purity  GaAs  result  by 
OMVPE,  by  Seki  et  aL,*  used  TEG  and  AsHj  at  a  V/lll 
ratio  of  2  and  reported  a  77  K  mobility  of  120000  cm2/ 
V  s.  At  reduced  pressures,  the  highest  purity  GaAs  was 
grown  at  a  V/III  ratio  of  17.3,  resulting  in  77  K  mobility 
of  190000  cm2/V  s  on  approximately  10  jim  films.’  High 
purity  GaAs  has  been  produced  at  a  V/III  ratio  of  8  (Ref. 
3)  whereas  in  this  study,  significantly  lower  V/III  ratios 
result  in  similar  quality  films.  The  reduction  of  the  V/III 
ratio  is  attributed  to  the  use  of  flow  modulation.  Low  V/III 
ratios  (V/III  =  5-20)  are  also  used  in  metalorganic  molec¬ 
ular  beam  epitaxy,11  but  best  results  ate  p-type  and  have 
carbon  concentrations  exceeding  mid  1014  cm"3 

An  investigation  of  high  purity  GaAs  grown  by  low 
pressure  OMVPE  with  flow  modulation  and  with  V/III 
ratios  approaching  unity  is  reported.  A  V/III  ratio  of  1.8 
resulted  in  a  film  with  a  77  K  mobility  exceeding  90  000 
cm2/V  s  and  a  room-temperature  mobility  exceeding  8000 
cm2/V  s.  Comparable  results  are  observed  with  a  V/III 
ratio  of  unity  provided  substrate  temperatures  greater  than 
610  ’C  are  used.  Finally,  the  AsH3  flow  requirement  for 
this  process  has  been  identified  and  determined  to  be  a 
strong  function  of  substrate  temperature  if  high  quality 
surfaces  are  to  be  obtained.  AH  films  which  were  observed 
to  have  mirrorlike  surfaces  are  of  high  purity  as  inferred 
from  low-temperature  Hall  and  photoluminescence  data. 
Growths  carried  out  with  subunity  V/III  ratios  were  char¬ 
acterized  by  poor  surfaces  and  reduced  growth  rates,  in¬ 
dicative  of  arsenic  diffusion  limited  growth. 

GaAs  layers  were  grown  using  FME  at  low  pressure 
(76  Torr)  in  a  vertical  barrel  multichamber  OMVPE  sys¬ 
tem.12  In  this  system  substrates  are  rotated  through  groups 
III  and  V  rich  spatially  separated  zones  without  valve 
switching.  During  the  group  III  exposure  cycles  the  local 
V/III  ratio  is  estimated  to  be  23%  of  the  average  value. 
The  substrate  then  enters  a  group  V  exposure  cycle.  The 
V/III  ratio  quoted  throughout  represents  the  average  val¬ 
ues  determined  by  the  total  injected  reactant  fluxes.  The 
susceptor  was  rotated  at  0. 1  rev/s  and  the  growth  rate  was 
8  monolayers/cycle  (1  jim/h). 

Undoped  epitaxial  layers  were  grown  using  adduct- 
purified  TEG20  and  AsH3  (100%).  Layer  thicknesses 
ranged  from  3-6  #im.  The  substrates,  ( 100)  Si-doped 
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FIG.  1  The  effects  of  V/III  ratio  and  substrate  temperature  on  surface 
morphology  of  GaAs  using  adduct  purified  TEG.  Also  the  right-hand  side 
vertical  axis  provides  estimates  of  the  arstne  growth  efficiency  which  is 
dependent  on  the  V/HI  ratio  The  arsine  growth  efficiency  is  defined  as 
the  ratio  of  the  arstne  in  the  gas  stream  to  arsenic  incorporated  in  the 
GaAs  deposited  on  the  active  portion  of  the  susceptor.  The  solid  line  was 
drawn  empirically  to  suggest  the  transition  between  the  good  and  bad 
morphology  regions  Inset  is  a  SEM  micrograph  (magnified  200 x  )  illus¬ 
trating  what  ts  meant  by  poor  morphology. 

n*  GaAs  and  (100)  semi-insulating  GaAs,  were  first  rinsed 
in  organic  solvents  and  then  etched  in 
5H;S04:1H202:1H:0  prior  to  growth.  The  TEG  was  held 
at  23  ”C  while  a  H:  flow  of  50  seem  was  passed  through  the 
bubbler,  maintained  at  100  Torr.  The  growth  temperature 
ranged  from  560  to  635  ”C,  while  the  V/III  ratio  varied 
from  0.7  to  22.  Growth  rate  measurements  were  performed 
using  angle  lapping  and  staining.  Thickness  uniformity  was 
±  1.2%  across  a  1.5  in.  diam  wafer.  Carrier  concentrations 
and  mobilities  were  measured  using  the  van  der  Pauw 
method  in  a  magnetic  field  of  3.5  kG  at  both  300  and  77  K. 
Low-temperature  (1-20  K)  photoluminescence  (PL)  was 
used  to  investigate  the  excitonic  features  as  well  as  to  iden¬ 
tify  the  acceptor  impurities. 

Arsine  efficiency  was  calculated  for  the  reactor  and  is 
defined  as  the  ratio  of  the  AsHy  in  the  gas  stream  to  the 
amount  of  As  incorporated  in  the  GaAs  on  the  entire  ac¬ 
tive  portion  of  the  susceptor.  The  maximum  possible  effi¬ 
ciency  (V/III  =  1)  is  31%,  where  high  quality  films  are 
observed  using  low-temperature  PL.  Conducting  films  are 
obtained  at  higher  V/III  ratios  (/r77  K  =  93  000  cm2/V  s 
and  p3OOK  =  800O  cm2/V  s)  where  the  arsine  AsH3  effi¬ 
ciency  is  calculated  to  be  17.2%  (V/III=1.8j.  Frevious 
studies  in  this  reactor  using  TMG  found  that  the  p-n  tran¬ 
sition  occurred  around  V/III  =  30  at  the  same  growth  pres¬ 
sure,  growth  rate  and  flow  modulation  where  a  V/III=70 


FIG  2  Hall  measurement  results  showing  dependence  of  Hi  net  1\„ 
—  .V„)  and  total  (,V4-f  A',)  impurity  concentration  and  (b)  77  KL  mobility 
on  V/III  ratio  in  undoped  GaAs  grown  at  635 'C  The  lines  are  drawn 
empirically  to  suggest  a  trend  in  the  data 

resulted  in  a  77  K.  mobility  of  96  000  cm:/V  s.i:  For  V/III 
ratios  less  than  30,  the  films  suffered  from  increasing  levels 
of  carbon  contamination.  Thus,  a  20-fold  improvement  in 
the  efficiency  of  AsH3  resulted  using  adduct-punfied  TEG 
in  place  of  TMG.  There  is  also  no  p-n  transition  with  a 
decrease  in  the  V/III  ratio  using  adduct-punfied  TEG 

The  V/III  ratio  and  growth  temperature  criteria  for 
good  surface  morphology  were  investigated  over  the  range 
from  560  to  635 'C.  In  Fig.  1,  a  good  morphology/bad 
morphology  transition  curve  is  shown  to  suggest  the  min¬ 
imum  V/III  ratio  required  at  a  given  growth  temperature. 
In  addition,  the  V/III  ratio  relationship  to  AsH,  efficiency 
is  also  provided  in  the  figure.  When  the  substrate  temper¬ 
ature  is  below  610 'C,  more  arsine  must  be  supplied  as 
shown,  indicating  that  less  AsH3  is  being  pyrolyzed.  As  the 
substrate  is  increased  beyond  610  *C,  the  transition  from 
good  morphology  to  poor  morphology  approaches  a  con¬ 
stant  V/III  value  of  unity.  This  suggests  that  the  AsH3 
arriving  at  the  growth  surface  is  completely  pyrolyzed,  and 
maximum  AsH3  efficiency  can  be  achieved  when  the 
growth  temperature  exceeds  610 ’C. 

The  impurity  concentration  and  low  temperature  (77 
K)  mobility  for  samples  grown  at  635  "C  with  V/III  ratios 
from  1.8  to  22  are  given  in  Fig.  2.  Total  impurity  concen¬ 
tration  (Nd+Na)  was  estimated  using  the  empirical  rela¬ 
tion  given  by  Stillman  and  Wolfe.  5  Net  impurity  concen¬ 
tration  varied  from  3.7(1014)  to  6.9(10u)  cm-3  while  Sd 
+  N„  varied  from  7.7(!0U)  to  2.0(1015)  cm"3.  The 
minimum  value  in  each  case  was  obtained  for  a  V/III  ratio 
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FIG.  3.  Low-temperature  (I  K)  PL  spectra  of  undoped  GaAs  layers 
grown  at  635  "C.  The  luminescence  intensity  is  magnified  by  the  factors 
Shown,  and  the  average  V/I1I  ratios  are  given  in  parenthesis  at  the  far 
nght  of  the  figure.  Excitation  conditions  and  experimental  resolution  are 
as  indicated. 

of  5,  where  residual  carbon  levels  are  in  the  low  10M  cm-3 
range.  The  77  K  mobility  varied  from  55  000  to  110  000 
cm2/V  s,  with  the  maximum  value  obtained  for  a  V/III 
ratio  of  5.  As  indicated  in  Fig.  2,  the  impurity  concentra¬ 
tion  is,  in  general,  an  increasing  function  of  V/III  ratio. 
The  highest  room-temperature  mobility  was  greater  than 
8000  cm2/V  s.  High  purity  results  are  readily  observed 
with  PL  at  a  V/III  ratio  of  unity  for  growth  temperatures 
greater  than  630  *C,  but  these  films  did  not  exhibit  electri¬ 
cal  conduction  due  to  increased  acceptor  compensation. 
Subunity  growth  had  Ga  rich  surfaces  which  prevented 
contact  formation. 

Low-temperature  PL  spectra  of  samples  grown  at 
635  ’C  are  shown  in  Fig.  3.  The  dominant  feature  in  the 
excitonic  region  of  the  spectra  is  that  of  the  neutral  donor 
exciton  {ETJC).  The  neutral  acceptor  exciton  peak  (A'JC) 
is  negligible  in  samples  grown  with  V/III  ratios  greater 
than  1.8,  strongly  indicating  n-type  material.2  This  is  con¬ 
sistent  with  Hall  measurement  results.  Two  acceptor  be¬ 
havior  in  the  PL  spectra  was  observed  using  variable  tem¬ 
perature  measurements  ( 1-20  K.)  for  V/III  ratios  less  than 
1.8.  Although  not  pronounced  in  the  1  K  spectra,  a  signa¬ 
ture  of  band-to-acceptor  transitions  of  carbon  (  —  1.493 
eV)  is  evident  in  spectra  observed  at  12  K.  for  V/III  ratios 
less  than  1.8.  The  peaks  which  appear  in  the  spectra  for 
V/III  ratios  of  1.8  and  unity  at  1.492  eV  behave  as  band- 
to-acceptor  transitions  with  the  acceptors  tentatively  iden¬ 
tified  as  magnesium.  The  corresponding  donor-acceptor 
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pair  luminescence  observed  at  1.489  eV  supports  this  as¬ 
signment  of  acceptor  species  which  is  believed  to  originate 
from  the  arsine  source.’  The  absence  of  carbon  acceptor 
related  luminescence  for  V/III  ratios  greater  than  1  8  sug¬ 
gests  that  sufficient  A$H3  is  present  to  remove  the  carbon 
from  the  growth  surface.  As  the  V/III  ratio  approaches 
unity,  the  magnitude  of  the  neutral  acceptor  exciton  peak 
is  comparable  to  that  of  the  neutral  donor  exciton  peak 
Finally,  when  the  V/III  ratio  is  reduced  to  subunity  (0.7), 
the  normal  excitonic  features  are  completely  absent  from 
the  spectra.  New  spectral  features  appear  much  weaker  in 
intensity,  possibly  due  to  defect  related  exciton  emission  at 
photon  energies  near  1.503  eV,15  commonly  observed  in 
molecular  beam  epitaxy  materials. 

In  summary,  by  applying  flow  modulation  techniques, 
highly  efficient  use  of  AsH3  has  been  demonstrated  with  an 
optimized  low  pressure  OMVPE  process  for  the  first  time 
It  has  been  observed  that  below  610 'C,  more  arsine  must 
be  supplied  to  sustain  good  morphology.  Above  610 ‘C, 
maximum  AsH3  efficiency  (V/III  —  1 )  can  be  obtained 
while  maintaining  specular  surfaces.  In  addition,  there  was 
no p-n  transition  region  in  the  range  studied.  Using  adduci- 
purified  TEG  and  AsH3  in  OMVPE  at  reduced  pressure, 
we  have  demonstrated  a  near-unity  V/III  ratio  resulting  in 
a  77  K  mobility  exceeding  90  000  cm2/V  s. 
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The  Use  of  Ultraviolet  Radiation  at  the  Congruent 
Sublimation  Temperature  of  Indium  Phosphide  to  Produce 
Enhanced  InP  "Schottky"  Barriers 

James  Singletery,  Jr.  and  James  R.  Sheafy 
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ABSTRACT 

This  paper  describes  an  ultraviolet  radiation-assisted  process,  optimized  around  the  congruent  sublimation  tempera¬ 
ture  of  InP,  which  fabricates  a  very  thin  insulating  layer  on  InP.  In  developing  this  process,  we  demonstrate,  among  other 
effects,  that  the  increase  in  the  barrier  height  is  not  caused  by  the  oxidation  of  the  surface  enhanced  by  the  presence  of 
•'zone,  but  enhanced  by  a  photoinduced  electron  transfer  (PET)  process  In  the  past,  some  researchers  have  considered 
similar  devices  to  be  enhanced  metal-semiconductor  Schottky  diodes  Although  we  achieved  a  barrier  height  of  0  7  V,  we 
present  measurements  of  series  resistance  and  ideality  factors  which  question  the  Schottky  character  of  these  devices 
Furthermore,  the  dramatic  increase  in  series  resistance,  as  the  barrier  increases,  suggests  that  the  gate  speed  for  microwave 
devices  fabricated  with  this  technology  may  be  less  than  expected  because  of  a  larger  than  expected  resistance  capacitarree 
time  constant.  The  instability  of  these  devices,  when  exposed  to  air,  suggest  that  among  the  oxides  which  make  up  the 
enhanced  layer,  P.O*  is  the  primary  material  responsible  for  enhancement. 


A  comparison  of  the  basic  transport  properties  of  GaAs 
and  InP  yields  an  advantage  to  InP  in  peak  and  saturation 
velocities,1  breakdown  field,  and  thermal  conductivity.2 
These  benefits  have  led  to  encouraging  device  results  in 
higher  power,1  faster  speed,4  lower  noise,5  and  increased 
radiation  hardness  ‘  However,  the  low  Schottky  barrier, 
formed  for  metal-semiconductor  (MES)  interfaces,  gener- 
ates  large  leakage  currents  that  eventually  degrade  the 
*peed,  power,  and  gain  of  MES  devices.  To  eliminate  this 
Problem,  researchers  typically  use  a  metal-msulator-semi- 
conductor  (MIS)  structure  using  SiO,  as  the  insulator  But 


others  have  demonstrated  the  instabilities  of  the  SiO./InP 
interface  under  dc  operating  conditions  ’  This  paper  de¬ 
scribes  an  ultraviolet  (UV)  radiation-assisted  process,  opti¬ 
mized  around  the  congruent  sublimation  temperature  of 
InP,  which  produces  Schottky  barriers  up  to  0  7  V  Based 
on  series  resistance  and  ideality  factor  measurements,  this 
paper  also  concludes  that  these  devices  exhibit  behavior 
more  like  MIS  structures  with  a  very  thin  insulating  layer 
rather  than  Schottky  diodes  In  addition,  the  increase  in 
series  resistance,  as  the  barrier  height  increases,  suggests 
that  the  gate  sneed  at  microwave  device c  „ 
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this  technology'  will  be  lower  than  expected  due  to  a  larger 
than  expected  resistance-capacitance  (RC!  time  constant 

Background 

Of  the  many  researchers  that  have  used  UV-assisted 
growth  to  enhance  InP’s  Schottky  bamer.'  Iliadis  at  the 
University  of  Maryland'  has  achieved  the  most  success  He 
has  successfully  developed  a  room-temperature  process 
which  increased  the  bamer  height  to  0  83  V  However,  his 
process  left  several  questions  unanswered 

In  preparing  his  samples  for  exposure,  Iliadis  used  HC1 
as  an  etch  Since  HC1  corrosively  etches  InP,  the  question 
remained  whether  the  use  of  HC1  represented  a  critical  step 
in  the  enhancement  process  or  whether  a  more  benign  etch 
such  as  HCI  HjPO,  could  be  used 

The  design  of  Iliadis's  apparatus  allowed  him  to  vary 
only  the  UV  radiation  exposure  time  Therefore,  questions 
remained  concerning  the  influence  of  other  parameters 
such  as  growth  temperature  and  radiation  intensity,  and 
whether  the  ozone  producing  wavelengths  are  critical  to 
the  process. 

Finally,  the  question  remained  as  to  whether  this  process 
influences  the  senes  resistance  of  the  device  We  felt  that 
the  senes  resistance  vanations  would  provide  a  clue  to  the 
true  Schottky  nature  of  these  devices  If  the  vanations  in¬ 
dicate  that  the  devices  are  MES  Schottky  diodes,  then  this 
would  lend  support  to  the  notion  that  unpinning  of  the 
Fermi  level  is  occurring  at  the  semiconductor  interface  If 
so,  this  feature  would  provide  device  designers  some 
flexibility  m  choosing  metals  that  might  produce  even 
higher  Schottky  barrier  heights  However,  if  the  senes  re¬ 
sistance  increases  at  a  much  greater  rate  than  expected, 
this  would  suggest  not  only  that  the  devices  are  not  MES 
Schottky  diodes,  but  also  that  the  devices  would  exhibit 
lower  microwave  cutoff  frequencies  because  of  a  larger  RC 
time  constant 

To  answer  these  questions,  we  constructed  a  special  ap¬ 
paratus  to  provide  some  flexibility  in  growth  parameters 
and  made  additional  current  and  voltage  measurements  to 
assess  the  true  nature  of  this  barrier  enhancement 

Preparation  of  Schottky  Diodes 

Apparatus — A  custom  built  work  station  provided 
flexibility  in  growth  temperature,  gas  composition  and 
flow,  light  intensity,  and  wavelength  (Fig  1)  Temperature 
control  equipment  consisted  of  a  collection  of  Research  Inc 
equipment  process  controller,  setpoint  programmer,  and 
phase  angle  controller.  Additional  signal  conditioning 
equipment  converted  the  signal  from  the  Research  Inc 
equipment  to  the  low-voltage  high-current  signal  needed 
to  dnve  the  heating  element  located  inside  the  process 
chamber 

A  Corso-Gray  Model  D-104-B-B/SS  gas  handling  system 
provided  control  of  the  gas  composition  and  flow  The  key 
system  components  included  Brooks  rotomeiers  which 


provided  a  maximum  flow  of  4  61  pm  for  O,  and  N,  and  a 
maximum  flow  of  1  71  pm  for  H. 

An  One!  Solar  Simulator  supplied  the  UV  radiation  The 
simulator  could  accept  lamps  of  three  different  character¬ 
istics  Although  a  variety  of  lamps  can  be  installed  in  the 
Solar  Simulator,'1'  for  all  experiments  in  this  paper  we 
used  the  1000  W  Hg-Xe  lamp  In  addition  by  replacing  the 
normal  mirror  in  the  Solar  Simulator  with  a  dichonc  mir¬ 
ror.  we  were  able  to  isolate  the  ozone  producing  wave¬ 
lengths  from  200  to  260  nm 

We  designed  the  process  chamber  to  handle  samples  up 
to  1  in  by  1  in  in  size  A  graphite  boat  mside  the  process 
chamber  determined  this  upper  limit  in  size  The  graphite 
boat  possessed  a  1/6  in  tall,  0  1  in  wide  lip  around  the  edge 
to  prevent  the  sample  from  sliding  from  the  graphite  boat 
during  exposure  Instead  of  heating  by  RF  induction,  we 
used  a  resistive  Nichrome  metal  platform  to  heat  the 
graphite  boat  by  conduction  The  Nichrome  heating  plat- 
form  was  16  in  by  2  0  in  and  0  005  m  thick  and  possessed 
supports  which  held  the  platform  and  graphite  boat  3  8  in 
above  the  entrance  and  exit  ports  for  the  gases  This  un¬ 
usual  positioning  of  the  heating  platform  and  gas  ports 
might  be  primarily  responsible  for  the  unusual  flow  rate 
effects  that  are  discussed  later  in  this  paper 

Sample  preparation  —  Sample  preparation  began  with  a 
degrease  procedure  which  involved  the  use  of  the  soap  so¬ 
lution  FI -70  and  a  DI  water  nnse  then  an  acetone  rinse 
and  ultrasonic  bath,  followed  by  a  methanol  nnse  and  ul¬ 
trasonic  bath  After  another  Dl  water  rinse,  the  next  step  in 
sample  preparation  involved  a  pre-etch,  to  remove  surface 
oxides,  using  HjSO,  HjOj  HjO  (5  11)  After  a  third  DI  wa¬ 
ter  nnse.  an  InP  etch  using  HC1  or  HC1  and  HjPO,  mixture 
was  then  performed,  the  Results  section  discusses  the  ad¬ 
vantage  of  one  solution  over  another  After  a  final  DI  water 
nnse,  the  samples  were  blown  dry  with  nitrogen 

Fabrication  of  diodes — We  fabricated  Schottky  diodes 
on  1  cm  by  1  cm  samples  cleaved  from  undoped  InP  sub¬ 
strates  with  a  earner  concentration  in  the  mid  10!'  err. 
range  After  performing  the  wet  chemical  procedure  de¬ 
scribed  above,  the  samples  were  placed  into  the  process 
chamber  for  irradiation  under  different  growth  conditions 
The  samples  were  held  in  place  with  a  perforated  metal 
mask,  each  perforation  allowed  for  the  deposition  of  dots 
which  were  127  pm  in  diameter  This  size  proved  practical 
for  two  reasons  The  dots  were  small  enough  to  make  cur¬ 
rent  measured  low  enough  to  prevent  the  saturation  of  the 
measurement  equipment  Nevertheless,  the  dots  were  still 
large  enough  to  make  the  alignment  of  the  measurement 
probes  easy 

Mounting  of  samples  —After  deposition,  the  samples 
were  mounted  on  a  3  in  by  3  in  by  1/8  in  copper  plate,  the 
top  surface  of  the  plate  was  coated  with  indium  to  allow  for 
ohmic  contacts  to  the  back  side  of  the  InP  samples  The 
actual  mounting  first  involved  heating  the  copper  plate  just 
enough  to  melt  the  indium  but  not  hot  enough  to  produce 
thermal  damage  on  the  InP  samples  This  requirement  was 
met  with  using  a  hot  plate  set  so  as  not  to  exceed  250' C 
After  the  indium  became  molten  and  the  samples  mounted, 
the  plate  was  removed  from  the  hot  plate  and  placed  on  a 
copper  heatsink  for  cooling 

Determination  of  Barrier  Height  from 
Current/Voltoge  Measurements 

We  calculated  the  barrier  heigh!  from  the  measured 
value  of  the  saturation  current  and  assumed  values  of  tem¬ 
perature,  diode  area,  and  Richardson  constant  The  satura  - 
tion  current  became  an  ideal  parameter  for  detecting  bar¬ 
rier  enhancement  As  we  will  show  later,  the  barrier 
enhancement  we  expected  required  orders  of  magnitude 
decrease  in  the  saturation  current  Such  an  expected  dra¬ 
matic  change  in  the  saturation  current  gave  us  confident e 
in  using  the  HP4145A  semiconductor  parameter  analyzer 
to  obtain  the  saturation  current 
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The  HP41 45  A  semiconductor  parameter  analyzer  — 
Several  features  made  the  HP4145A  very  useful  for  doing 
current/voltage  measurements.  The  first  important  feature 
was  the  four  programmable  units;  only  two  were  required 
to  characterize  the  diodes  in  these  experiment  Each  unit 
could  be  programmed  to  provide  or  measure  voltage  from  0 
to  100  V  and  current  from  1  pA  to  100  mA.  An  extremely 
useful  feature  was  the  HP4145's  ability  to  create  parame¬ 
ters  (called  “user  defined  functions”)  that  are  constructed 
from  mathematical  operators  and  voltage  and  current  vari¬ 
ables.  One  HP4145  operator  that  we  found  particularly 
helpful  was  the  A  operator,  which  allowed  the  construction 
of  a  parameter  that  became  useful  in  determining  the  senes 
resistance,  we  discuss  this  parameter,  which  was  based  on 
the  measurement  of  the  differential  current  and  differen¬ 
tial  voltage,  in  the  next  subsection  We  were  also  able  to 
obtain  linear  and  logarithm  plots  of  the  user-defined  func¬ 
tions  by  using  the  graphics  routine  on  the  HP4145A.  The 
graphics  package  also  contained  a  very  useful  straight  line 
curve-fitting  routine  that  allowed  us  to  obtain  the  senes 
resistance,  ideality  factor,  and  saturation  current  with  few 
computations.  Lastly,  the  most  convenient  feature  of  the 
HP4145  was  the  ability  to  store  measurement 
configurations  which  eliminated  the  need  to  reprogram  the 
HP4145  for  each  measurement  For  this  expenment,  three 
programs  were  developed:  one  program  to  perform  a  typi¬ 
cal  forward  biased  1  vs.  V  measurement  from  0  to  0  25  V, 
another  program  to  extract  the  series  resistance  from  these 
measurements,  and  a  third  program  to  correct  the  I  vs.  V 
data  in  order  to  obtain  the  saturation  current  Since  the  last 
two  programs  use  the  HP4145  in  an  unusual  manner,  the 
salient  features  of  these  programs  are  discussed  in  the  next 
two  subsections. 

Using  the  HP4145  to  determine  series  resistance  —The 
theoretical  basis  for  this  procedure  began  with  an  exten¬ 
sion  to  the  thermionic  emission  model  for  Schottky  diodes 
to  include  the  influence  of  series  resistance.  Eq  1 

f  =  /„[e  *r  _  1]  (U 

Equation  2  below  demonstrates  the  relationship  between 
the  saturation  current  to  the  barrier  height 

/„  =  A,mT1Ae^  [21 
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f  ig.  2.  HP4I45  plot  used  to  obtain  tenet  resistance.  Region  be¬ 
tween  crosses  marks  the  range  of  measured  data  used  to  perform  the 
linear  curve  fit. 


demonstrates  how  well  the  model  matched  data  from  an 
actual  diode  The  excellent  fit  to  a  straight  line  made  the 
determination  of  the  series  resistance  easy  For  the  exam¬ 
ple,  in  Fig  2.  the  X  intercept  from  the  curve  fiu.'c  routine 
gave  a  series  resistance  of  17  fl  The  inverse  slope.  1/GRAD, 
related  to  an  ideality  factor  of  1  09.  which  we  obtained  by 
dividing  1/GRAD  by  the  room  temperature  value  of  KT/q 
(25  8E-03) 

U  mg  the  HP4 145  to  determine  saturation  current  —Be¬ 
fore  extracting  the  saturation  current  from  the  measure¬ 
ments.  we  programmed  the  HP4145A  to  remove  the  effects 
of  senes  resistance  The  user -defined  (unction  feature  al¬ 
lowed  us  to  subtract  1  R ,  from  the  measured  voltage  (Eq  5) 

VD=  V-/  R,  [5] 

With  this  adjustment,  a  slightly  modified  version  of  Eq  3 
resulted  (Eq  6) 

q'r 

I  =  /0e"KT  (6[ 


We  thought  a  measurement  scheme  for  the  series  resist¬ 
ance  that  was  independent  of  the  barrier  height  would  be 
beneficial  for  our  analysis  Hence  we  developed  a  method  to 
eliminate  the  saturation  current  from  consideration  To  be¬ 
gin,  we  restricted  our  analysis  to  the  forward-biased  region 
of  the  diode  since,  except  for  small  applied  voltages  (i.e,, 
below  60  mV),  the  exponential  term  would  dominate  the  -1 
term  in  Eq  1.  Therefore,  in  this  region,  Eq  lean  be  written 
as 


nrv-iiv 
1  =  I„e  "KT 


(3) 


Since  the  barrier  height  is  contained  in  the  expression  for 
the  saturation  current  (Eq  2)  and  is  independent  of 
voltage,  we  were  able  to  eliminate  the  barrier  height  from 
consideration  and  retain  the  series  resistance  by  taking  the 
natural  logarithm  of  Eq  3  and  then  differentiating  the  re¬ 
sult  with  respect  to  /  and  V.  After  some  algebraic  manipu¬ 
lation,  Eq.  4  below  resulted 


J  ~  r\KT  l  dl 


[4! 


Equation  4  is  not  only  independent  of  the  barrier  height  but 
also  matches  the  straight  line  equation  y  =  m(x  -  b ).  where 
y  equals  1  //,  x  equals  dV/dl,  m  relates  to  the  ideality  factor, 
and  most  important,  b,  the  X  intercept,  gives  the  series 
resistance.  Equation  4  was  programmed  into  the  HP4145 
using  the  user-defined  function  feature  to  define  the  x  and 
y  variables.  In  particular,  the  definition  of  the  x  variable 
made  use  of  the  A  operator  to  obtain  dV  and  dl  Figure  2 


Although  this  change  is  small,  a  plot  of  the  logff )  vs  VV 
demonstrates  a  significant  benefit  (Fig  3)  While  a  plot  not 
corrected  for  sene  resistance  would  maintain  some  curva- 
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Fig.  3.  HP4145  plot  used  to  obtain  saturation  current.  Region  be¬ 
tween  crosses  marks  the  range  of  measured  data  used  to  perform  the 
linear  curve  fit. 
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ture  throughout  the  forward-biased  region  the  corrected 
ptoi  shown  in  Fig  3  exhibited  straight  line  behavior  above 
60  mV  Below  60  mV,  the  exponential  term  no  longer  dom¬ 
inates  the  diode  characteristics,  instead  the  more  precise 
Eq  1  describes  the  performance  in  this  region  The  loga 
nthm  of  Eq  6  not  only  predicts  the  linear  behavior  but  also 
provides  a  means  of  determining  vne  saturation  current 

(Eq  7) 


log!/)  =  log(U  ♦  0  4343  V,,  (?! 

Equation  7  matches  a  slightly  different  linear  equation 
than  the  one  used  to  obtain  the  series  resistance  Rather 
than  matching  the  form  y  -  m(x  -  b)  Eq  7  matches  the 
form  y  =  tot  ♦  6,  where  y  equals  !ogi/i,  jr  equals  V7,  m 
relates  to  the  ideality  factor,  and  most  important  t>  the  Y 
intercept  relates  to  the  s. duration  current,  the  saturation 
current  is  actually  the  antilog  of  the  V  intercept  For  Un- 
example  in  Fig  3,  w  hich  has  a  series  resistance  of  17  U,  a 
saturation  current  of  0  896  pA  is  obtained 

Computation  of  barrier  height  —Once  the  saturation 
current  is  known,  a  rearranged  version  of  Eq  2  was  ob¬ 
tained  to  obtain  the  barrier  height  (Eq  8) 


A“T  n 
»  t 


(»l 


The  actual  computation  for  the  barrier  height  was  per 
formed  using  the  spread  sheet  Excel  for  the  Macintosh  For 
the  value  of  the  saturation  current  obtained  earlier  a  tv  pi- 
cal  In?  barrier  height  of  0  48  V  was  obtained 


Results  and  Discussion 

Before  starting  this  work,  a  barrier  enhancement  of  o  8  V 
was  established  as  a  target  With  this  in  mind,  we  used 
Eq  8  to  estimate  the  change  m  the  saturation  current  to 
obtain  this  level  of  harrier  enhancement  The  result  e,(  this 
analysis  predicted  a  drop  of  10!  This  expected  large  drop 
reinforces  the  benefit  of  using  the  saturation  current  as  the 
indicator  of  barrier  enhancement 

To  demonstrate  to  ourselves  that  the  UV  radiation  might 
have  a  dramatic  effect  on  an  InP  surface,  we  performed  a 
series  of  experiments  at  a  relatively  high  temperature  InF 
substrates  were  exposed  to  a  700’C  environment  for  2  mm 
in  three  different  atmospheres  H  ,  N  and  O  For  com¬ 
parison,  some  samples  were  additionally  exposed  to  radia¬ 
tion  from  a  1  k  W  Hg-Xe  ozone  free  lamp  Since  the  ambient 
temperature  is  much  larger  than  the  congruent  sublimation 
temperature  of  InP,  we  expected  severe  erosion  of  the  InP 
surface  regardless  of  the  ambient  However  due  to  the 
short  exposure  time  only  samples  exposed  to  the  H  ambi¬ 
ent  showed  any  noticeable  erosion  Samples  exposed  to  the 
N,  ambient  regardless  of  UV  exposure  illustrated  no  sur¬ 
face  damage,  while  the  samples  exposed  to  the  O  ambient 
exhibited  a  significant  oxide  growth  for  the  sample  ex¬ 
posed  to  UV  radiation  Possible  explanations  for  these  ob¬ 
servations  are  discussed  in  the  next  few  paragraphs 

Figure  4  shows  the  results  of  a  sample  exposed  to  an  H 
ambient  without  UV  radiation  The  severe  erosion  was  ex¬ 
pected  but  as  supported  by  data  presented  later  this  ero¬ 
sion  was  not  caused  by  the  sublimation  of  phosphorus  as 
we  initially  thought,  but  more  likely  caused  by  a  reaction 
between  the  phosphorus  just  above  the  substrate  and  H 
As  shown  with  Eq  9  and  10,  this  reaction  most  likely  leads 
to  formation  of  phosphine  As  the  PH  in  Eq  10  is  swept 
from  the  process  chamber,  Eq  9  will  continue  m  the  for 
ward  direction,  thus  siphoning  more  phosphorus  from  the 
InP  substrate  and  leaving  indium  droplets  shown  in  Fig  6 
on  the  surface 


InP(s)  ln(s)  *  l  P  (g)  (9j 

\  P«(g)  ♦?  H,(g)  -  PH  <g)  1101 


4  M-  wilhoul  t/V  rodwtuarv  fxpotad  foe  2  ev«e  o'  700  C 
Nomordu  phoeomarogroph  tok*«  o>  50  limes  mag<vh<a»*on  f  xpo 
sue*  produced  eapecWd  detenorOhon  of  lh«  h*t4<x« 


Figure  f  illustrates  the  out* --me  of  ur.eU.c;  sample  ex¬ 
posed  to  the  same  conditions  as  th»  hr«t  -.-.amp.*  •  \< ept  l.-t 
the  addition  '  UV  radiation  from  a  ;  kW  Hg  X<  ;■>■■■-  free 
lamp  The  raiutu-r.  seems  to  make  the  erosion  -  f  tt.<  m/- 
face  more  uniform  This  led  us  P-  believe  ihat  She  UV  raft 

a! ion  contributed  significantly  to  the  surface  . . .  <•< 

earring  or.  the  ir.P  substrate 
We  next  examined  the  be  has  ;<>t  of  trie  Ir.P  ;r.  *  N  atm.--' 
pheri-  If  indeed  H  played  little  role-  m  the  therm  er 
of  the  IrtP  surface  we  would  observe  the  -UJ --  -  daft  .«g<  • 

that  illustrated  •«>  Fig  4  and  5  Surprising.'-.  »•  Kg  «'  ar.d 
7  show .  r...  e:->v---r.  occurred  for  a  sample  piaied  -  an  N 
ambi'-nt  regardless  of  the  presence--  of  rart  a';  - 
Hence  three  nhser  alums  support  the  note. -r.  wf,,,e  % 

acts  as  an  liter*  gas  to  InF  H  plav  savers  art.  «  mi  n  th« 
thermal  eros.i  -  of  InP  This  reasoning  o-  Ir-r.*  f 
port  tn  the  ivtiserv ations  of  Greene  ar.d  Tr  u  -  * 

Next  w e  ;r.-- e«t igaifd  the  bohav tor  of  t hen  -,a-  ;  -  ■  -  ,tt 
O-  atmosph-eir  A  comparison  t.f  Fig  8  and  '•  lent  a-, 
unusual  ob«  -  g;  r,  Without  the  use  of  UV  rad  a'  -  n  s. 
shown  ir.  Fig  -  the  surface  is  well  preserver*  -ugg<-t  ng 
that  for  the-.*  short  exposure  times  O  act-  n-  -r  l  if  ar. 
inert  gas  H- r  ■  With  the  addltie-r.  of  the  UV  rad  ut-.or 
(Fig  9!  a  th;- ‘".rr  is  produced  or,  the  surface  u  h.ch  A.jgi- 
measurement --  revealed  as  composed  of  ]r,  anr*  -  -  }  g  ;  - 
thus  sugges-.eg  the  formation  of  the  oxide  Jr  K  Th-.-.  <- 

nano  is  posed  .e  since  the  phosphorus  Ebc  rate-?*  pr-  t-ab  , 
too  volatile  •  participate  ir.  any  surface  rc«-  '  ■  n-  fsjr.g 
the  index  of  refraction  for  ir,  O  cilspsome  Ur  measure 
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Fij.  6.  N,  without  UV  radiotion.  Exposed  for  2  mm  at  700‘C.  fig.  8.  O,  wifoou!  UV  radiation  Exposed  for  2  mm  at  700  C 
Nomonti  photomKrogroph  token  at  SO  form  moqnrficofioo.  N»«r  Nomorski  photomicragraph  taken  at  SO  tunes  magnification  Preset 
valion  of  surface  was  a  iwprjit.  votkyi  of  surface  was  o  wrpnu 


Fig.  7.  Nj  with  UV  radiation.  Exposed  for  2  mm  at  700’C. 
Nomorski  photomicrograph  token  at  50  form  magnification.  Preser¬ 
vation  of  surface  was  a  surprise. 


merits  were  made  to  estimate  the  thickness  at  500  A  Since 
the  ozone-free  UV  lamp  was  used  for  this  process,  the  reac¬ 
tion  of  ozone  with  the  InP  surface  cannot  be  responsible  for 
this  reaction  Instead,  in  addition  to  the  catalyst  role  no¬ 
ticed  in  the  reaction  with  Hj,  a  photoinduced  electron 
transfer  (PET)  process  is  more  likely  responsible  for  pro¬ 
ducing  O;  molecules  which  then  participate  in  the  oxida¬ 
tion  of  the  surface  to  produce  indium  oxide 
We  believe  that  there  are  two  possible  PET  reactions  that 
result  in  excess  electrons  that  convert  the  InP  surface  into 
an  oxidizing  agent  which,  due  to  O,  relatively  high  electron 
affinity  (0  45  eV).  reacts  easily  with  oxygen  to  eventually 
produce  an  indium  oxide  film 
The  first  possible  PET  reaction,  described  by  Channon 
and  Eberson,13  relies  on  the  generation  of  excess  electrons 
in  the  conduction  band  of  n-type  by  photostimulation  of 
electrons  from  the  valence  band  and  into  the  conduction 
band,  resulting  in  bandbending  at  the  surface  to  reflect  the 
increase  concentration  of  electrons  and  holes  Equation  1 1 
illustrates  how  the  activated  surface  converts  (S  C  )  O  to 
O.. 

SC  ♦fiv-tSC.*  +0- iSC  *0;  (HI 


,tr  ■  ~.jf  ;*v.  , 

mp  •'  \  ‘  • i  f  fp 

r*  f 

jf  '  '  -’y"  r 


Fig.  9.  O;  wilh  UV  radiation.  Exposed  for  2  mm  a*  700  C. 
Nomorski  photomicrograph  token  ol  50  time*  magnification  Most 
surprising  wos  tfie  formation  of  fhm  brownish  film,  indicated  by  foe 
obrvo*  change  in  shode  jut*  befot*  foe  biemisKet  on  foe  surface, 
which  covered  most  of  foe  surface. 


With  indium  as  the  surface  atom  for  the  second  type  of  PET 
reaction.  Eq  13  and  14  illustrate  the  possible  reaction  and 


0  4  8  12  16  20  24  28  32 


The  second  possible  reaction,  described  by  Fox1*  and  ilius-  SPUTTER  TIME  (minutes) 

trated  in  Eq  12.  generates  excess  electrons  bv  photostimu¬ 
lation  of  surface  atoms  (S  A  )  Fi9  >0.  Auger  deofo  profile.  In.  P,  and  O  ore  relatively  strong 

signals.  Tfie  C  signal  is  within  foe  background  noise  level  of  foe 
S  A  +  fiv  -♦  S  A  *  ♦  O;  — *  S  A  '  *  O,  t!2)  instrument. 
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the  activation  energies  based  on  electrode  potential  ener¬ 
gies” 

In  -*  In'  ♦  e'  E,  =  0  14  eV  [131 

e  ♦  02  — r  Oj 

In  -»  In'  +  3e  E,  =  0 .34  eV  (HI 

3  «  ♦  3  O,  — ♦  3  Oj 

Because  the  energy  (or  UV  photons  is  above  3  0  eV  and  the 
energy  gap  of  InP  is  1  35  eV,  both  mechanisms  and  both 
ionization  reactions  most  Likely  occur  to  produce  radical 

o; 

Once  the  radical  species  is  produced,  one  possible  reac¬ 
tion  to  produce  indium  oxide  is  shown  in  Eq  15 

In’  ♦  O;  -  \  ln20,  ♦  |  O,  [151 

Hence,  thermodynamic  arguments  can  be  made  to  sup¬ 
port  the  claim  that  ozone  is  not  primarily  responsible  for 
the  layer  formation  but  instead  oxidation  of  a  surface  en¬ 
hanced  by  the  photoinduced  transfer  of  electrons 
After  the  high-temperature  experiment,  we  attempted  to 
reproduce  the  room-temperature  Schottky  enhancement 
achieved  by  Iliadis  However,  we  met  with  little  success 
Less  than  1%  of  the  samples  processed  demonstrated  any 
evidence  of  barrier  enhancement.  Furthermore,  for  those 
samples  exhibiting  enhancement,  the  effect  was  very  local¬ 
ized,  at  most  three  diodes  from  an  array  of  144  exhibited 
any  enhancement  At  this  point,  the  idea  of  optimizing  the 
process  at  the  congruent  sublimation  temperature  became 
more  appealing  In  pursuing  this  line  of  investigation,  the 
importance  of  other  parameters,  which  were  needed  to 
achieve  the  greatest  possible  enhancement,  became  clear 
In  presenting  our  results,  several  effects  are  presented  on 
the  same  table  This  was  not  done  to  suggest  any  connection 
between  the  parameters,  although  future  studies  might  re¬ 
veal  such  relationship,  but  instead,  to  present  the  results  in 
a  more  compact  form  and  demonstrate  how  the  experi¬ 
ments  proceed  chronologically. 

The  first  set  of  experiments  involved  analysis  of  tempera¬ 
ture  and  wavelength  (Table  I).  Since  the  dichoric  mirror 
was  still  in  place  from  the  room-temperature  experiments, 
the  first  three  experiments  only  used  the  ozone  producing 
wavelengths  Three  different  temperatures  were  examined, 
with  no  enhancement  occurring  Next,  the  dichoric  mirror 
was  replaced  with  the  Solar  Simulator’s  normal  mirror, 
and  the  temperature  ranges  repeated.  This  time  enhance¬ 
ment  occurred  at  368°C  The  results  of  these  experiments 
gave  the  first  indication  that  the  congruent  sublimation 
temperature  was  a  critical  parameter  for  this  process.  The 
results  also  indicated  that  the  ozone  producing  wavelength 
had  little  influence  on  the  enhancement  process 
The  sensitivity  of  this  process  to  the  congruent  sublima¬ 
tion  temperature  appears  to  make  sense.  Since  the  en¬ 
hanced  layer  most  likely  contains  insulating  phosphorus 
oxides,  either  InPO,  or  PiOs  or  both,”  and  because  the  sub¬ 
strate  is  the  only  source  of  phosphorus,  the  ability  to  trap, 
into  the  enhancement  layer,  the  phosphorus  being  liber¬ 
ated  from  the  substrate  becomes  important.  Unlike  the 
high  volatility  at  elevated  temperatures,  the  phosphorus 
liberated  at  the  congruent  sublimation  temperature  should 
be  less  mobile,  thus  allowing  the  phosphorus  to  interact 
with  the  O',  radical  produced  by  a  photoinduced  electron 
transfer  (PET)  effect.  The  observed  PET  effect  at  this  tern- 
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204-21 1“C 

368”C 

628’C 

Ozone 

0  47  V 

0  47  V 

0  47  V 

Full 

0  47  V 

0.55  V 

0  47  V 

Sample  preparation  HC1  etch.  Growth  parameters  flow  rate  = 
1540  seem,  intensity  =  98  mW/cm’,  exposure  time  *  2  min 


per  a  lure  is  consistent  with  the  high-temperaiurc  results 
discussed  earlier  in  this  paper  Equations  16  and  17  de¬ 
scribe  the  initial  reactions  which  might  involve  the  radical 
O,  The  surface  atom  (S  A  )  is  most  likely  indium  As  a 
result,  Eq  18,  19,  and  20  describe  possible  terminal  reac¬ 
tions  which  produce  the  surface  oxides  Note  that  the  lor. 
tzed  surface  sites  may  play  an  important  role  in  attracting 
the  phosphorus  oxides  back  to  the  surface,  since  electro¬ 
static  attraction  between  surface  sites  and  the  gas 
molecules  may  be  strong  enough  to  prevent  phosphorus 
oxides  from  being  swept  from  the  surface 


SC 

♦  fiv-.SC*  «0. -tSC’-O, 

116; 

S  A 

♦  #»  v  S  A  *  ♦  Oj  S  A  ’  ♦  Os 

[17; 

In’  ♦  O,  -4  |  ln,0,  ♦  |  O, 

118; 

In’’(s)  « 

In'(s)  ♦  P,(g>  .  4  O.(g)  -t  21nPO,(s) 

[19; 

5S  C  ’(s) 

♦  P,(g)  ♦  5  O.tg)  -  2P,0,(s)  ♦  5S  C 

[20; 

After  the  success  of  the  first  set  of  experiments  we  next 
addressed  the  questions  of  whether  the  etch  solution  HCi 
was  critical  to  the  process  and  whether  ther»  was  ar,  upper 
limit  to  the  exposure  time  As  mentioned  in  the  previous 
chapter,  HCI  corrosively  etches  InP  From  a  device  process¬ 
ing  point  of  view,  the  use  of  a  more  benign  and  controllable 
etch  becomes  important  The  mixture  of  HCi  and  H;PO. 
appeared  the  best  choice,  since  this  mixture  not  only  pro¬ 
duces  an  excellent  morphology  but  also  exhibits  an  ad¬ 
justable  etch  rate  based  on  the  portion  of  H,PO,  By  adding 
larger  portions  of  HjPO,  to  the  mixture,  the  etch  rate  could 
be  decreased  from  12  to  0  5  iim/mm  ”  For  our  experiments, 
a  1:4  (HCI  H,PO«)  mixture  was  used,  which  produced  a 
modest  etch  rate  of  1  pm/min  An  investigation  of  the  ef¬ 
fects  of  exposure  time  was  conducted  since  Iliadis  demon¬ 
strated  in  his  work  that  the  barrier  height  saturated  at  an 
exposure  time  of  40  min  Since  our  experiments  used  a 
higher  intensity  (100  mW/cm’  t-’i  15  mW/cm!l,  we  expected 
to  observe  a  similar  effect  within  6  mm  of  exposure  time 
Table  II  summarizes  the  results  of  the  second  set  of  ex¬ 
periments,  which  demonstrate  that  the  HC!  HjPO,  etch  is  a 
suitable  substitute  for  HCI  and  the  barrier  height  saturates 
at  5  min  The  result  of  the  bamer-height  saturation  time 
seems  to  be  consistent  wnth  Iliadis 's  results.  thus  suggesting 
that  an  energy  density  limit  might  exist  for  this  enhance¬ 
ment  process  We  believe  this  is  most  likely  linked  to  the 
penetration  depth  of  the  UV  radiation  into  the  substrate 
during  the  PET  process  Since  the  InP  extinction  coefficient 
in  the  UV  range  is  at  least  an  order  of  magnitude  greater 
than  the  extinction  coefficient  in  any  wavelength  region 
emitted  from  the  Solar  Simulator,"  UV-dnven  reactions 
would  be  limited  to  the  few  monolayers  close  to  the  surface 
Therefore,  the  number  of  activated  sites  in  Iliadis's  and  our 
experiment  would  be  approximately  the  same,  hence,  the 
shorter  saturation  time  we  observed  would  be  consistent 
with  the  higher  irTadiance  available  with  our  apparatus 
Having  identified  a  suitable  etch  solution  and  exposure 
time,  we  returned  to  refining  the  temperature  effect 
Table  III  illustrates  the  results  of  these  experiments 
Significant  barrier  enhancement  occurs  from  350  to 
380°C.  These  results  reaffirm  the  dynamics  of  the  process 
discussed  earlier  In  addition,  the  highest  barrier  height 
0.69  V,  gave  indications  of  a  flow-rate  effect  since  that  par- 


Toble  B.  Itch  solution  ond  time  •Beets. 


Temp 

(*CI 

Etch 

Time 

(min) 

Barrier  heigh! 
(V) 

368 

HCI 

2 

0  55 

366 

HCI  H,PO, 

5 

0  59 

366 

HCI  HjPO, 

30 

0  60 

Growth  parameters  flow  rale  =  1540  seem  intensity  *  98  mW; 
cm’,  wavelength  *  full  spectrum 
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Table  01.  Refinement  of  temperature  effect. 


Temp 

(C) 

Bamer  height 
(V) 

394 

0  47 

380 

0  59 

366 

0  59 

353 

0  51 

348 

0  69 

Sample  preparation  etch  HC1  HjPO,  Growth  parameters  inten- 
tity  »  98  mW/cra1,  exposure  time  =  5  min.  wevelength  *  full  spec¬ 
trum.  flow  rate  *  1540  seem  (lest  sample  in  Table  processed  at  flow 
rste  of  614  seem) 


Table  IV.  Refinement  of  Row  rate  effect. 


Flow 

(seem) 

Percent  full  scale 

Bamer  height 

(V') 

1540 

33  3 

0  51 

920 

20  1 

0  50 

614 

13  3 

0  67 

301 

6  7 

0  48 

0 

0.0 

0  49 

Sample  preparation  etch  HCl:H,PO,  Growth  parameters  inten¬ 
sity  =  98  mW/cm1,  exposure  time  »  5  min,  wavelength  «  full  spec¬ 
trum,  temperature  *  348“C. 


ticular  experiment  was  performed  at  the  lower  flow  rate 
than  the  other  samples,  614  seem. 

The  results  at  348°C  prompted  another  set  of  experi¬ 
ments  (Table  IV)  to  isolate  the  proper  flow  rate  The  earlier 
conditions  were  repeated  and  produced  a  barrier  height  of 
0.67  V.  Other  experiments  were  performed  at  flows  higher 
and  lower  than  614  seem.  However,  none  of  these  experi¬ 
ments  resulted  in  a  barrier  height  greater  than  that 
achieved  at  614  seem  We  believe  one  of  two  reasons  could 
be  used  to  explain  this  unusual  result.  One  reason  could  be 
that  the  kinetics  of  the  process  may  require  the  contribu¬ 
tion  of  unreacted  02  at  a  particular  speed  to  enhance  an 
intermediate  chemical  reaction  Another  reason,  as  dis¬ 
cussed  earlier,  could  be  an  unusual  flow  pattern  influence 
by  the  layout  of  the  process  chamber.  Further  study  is 
needed  to  identify  a  plausible  reason 

The  sample  with  the  highest  barrier  in  the  previous  set  of 
experiments  was  also  studied  to  determine  whether  the 
barrier  height  remained  stable  in  air.  Table  V  illustrates 
the  results  of  these  measurements  Unfortunately,  we 
measured  a  decay  of  the  barrier  height  to  0  57  V  within 
48  h.  after  which  the  barrier  decayed  to  0.55  V  in  45  days. 
We  believe  this  decay  was  caused  by  the  reaction  of  P,Os 
with  water  vapor  in  the  atmosphere  since  this  oxide  is  one 
of  the  most  efficient  drying  agent;  used  in  desiccants 
(0.5  grains  of  water  removed  per  gram  of  PjOj)  "  To  solve 
this  problem,  a  process  to  encapsulate  and/or  anneal  these 
devices  will  be  necessary  to  guarantee  their  long-term  sta¬ 
bility.  Thus  what  initially  appeared  as  a  simple  process  to 
enhance  InP’s  Schottky  barrier  is  becoming  more  complex. 

We  thought  a  close  examination  of  the  changes  in  the 
series  resistance  and  ideality  factor  would  reveal  whether 
these  devices  are  MES  Schottky  diodes  For  MES  Schottky 
diodes,  we  believe  that  at  best,  the  series  resistance  should 
remain  constant  as  the  barrier  height  increases,  and  at 
worst,  the  resistance  should  relate  to  the  length  of  the  de¬ 
pletion  region  generated  by  the  Schottky  barrier  The  theo- 
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retical  model  we  used  for  the  latter  case  views  the  depletion 
region  as  a  linear  resistor  From  this  standpoint,  we  derived 
Eq  21 


£ 
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The  approximation  in  Eq  21  reflects  the  fact  that  the 
expression  ignores  the  few  nuilielectron  volts  difference 
between  the  bottom  of  the  conduction  band  and  the  Fermi 
level  within  the  neutral  region  of  the  semiconductor  mate¬ 
rial.  These  differences  are  typically  an  order  of  magnitude 
less  than  the  barrier  height  and  as  such  have  little  influence 
on  the  analysis  of  the  experimental  results 
Since  the  resistivity  of  the  depletion  region  is  difficult  to 
obtain,  we  developed  an  expression  for  resistance  which 
eliminated  the  resistivity  by  considering  the  percent  in¬ 
crease  in  the  enhanced  resistance  from  the  resistance  of  a 
normal  Schottky  diode  (see  Eq  22) 

%  increase  =  x  100  (221 


For  the  size  of  the  Schottky  diodes  used  in  this  experi¬ 
ment,  the  resistance  of  a  normal  Schottky  diode,  which  has 
a  barrier  height  of  0  45  V,  is  21.1  ft;  as  a  result,  the  expected 
resistance  of  an  enhanced  layer  is  expressed  by  Eq  23 
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Figure  11  demonstrates  how,  even  for  the  lower  barrier 
heights,  the  measured  senes  resistance  vanes  drastically 
from  the  expected  variation  For  the  higher  barrier  heights, 
the  discrepancy  is  much  worse  We  believe  these  results 
indicate  that  a  thin  insulating  bamer  is  being  formed  bo- 
tween  the  diode  metal  gate  and  the  semiconducting  surface 
as  the  barrier  is  increased  This  would  strongly  suggest  the 
formation  of  a  MIS  structure  rather  than  a  MES  Schottky 
diode  The  evidence  becomes  more  compelling  in  this  direc¬ 
tion  if  the  variation  in  the  ideality  factor  is  also  considered 
If  the  enhancement  represented  a  MES  Schottky  diode,  the 
ideality  factor  would  remain  close  to  1  00  But,  an  examin¬ 
ation  of  Fig.  12  shows  that  this  is  not  the  case  Instead  the 
ideality  factor  increases  as  the  bamer  height  increases, 
which  is  an  indication  that  the  diode  characteristics  are 
moving  further  and  further  away  from  ideal  behavior 
Since  the  variations  in  the  series  resistance  and  ideality- 
factor  indicate  that  these  devices  are  not  MES-enhanced 
Schottky  barriers,  the  expected  unpinning  of  the  Fermi 
level  is  not  a  by-product  of  our  enhancement  process  Fur¬ 
thermore,  for  gate  regions  fabricated  from  this  technology, 
we  would  expect  lower  cutoff  frequencies  due  to  larger 
than  expected  RC  time  constants 

Conclusions 

A  number  of  important  observations  were  made  with  this 
set  of  experiments  To  begin,  we  established  that  the  HC1 
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etch  is  not  critical  to  the  enhancement  process  As  a  result, 
we  were  able  to  replace  this  etch  with  the  more  benign 
mixture  of  HCl:H3PO,.  The  optimal  growth  parameters  in¬ 
clude  growth  temperatures  in  the  340  to  SSOX  range,  a  flow¬ 
rate  of  O,  *»  600  seem,  a  saturation  exposure  time  of  5  min 
We  also  determined  that  the  o2one  producing  wavelengths 
are  not  critical  to  the  process.  Finally,  the  device  character¬ 
istics  indicate  that  the  barrier  height  is  susceptible  to  the 
water  vapor  in  the  air,  and  the  variation  of  the  senes  resis¬ 
tance  and  ideality  factor  as  the  barrier  height  increases 
suggest  that  the  devices  are  not  MES  Schottky  diodes 

Although  not  MES  Schottky  diodes,  the  advantage  of  a 
lower  saturation  current  would  be  beneficial  for  a  number 
of  device  applications.  For  instance,  this  thin  insulating 
layer  could  be  used  as  an  intermediate  layer  between  the 
InP  surface  and  a  SiOs  layer  in  order  to  enhance  the  sta¬ 
bility  of  SiOj  MIS  devices.  Therefore,  future  work  will  con¬ 
tinue  in  order  to  answer  the  questions  raised  by  the  exper¬ 
iments  in  this  paper  and  to  alleviate  the  susceptibility  of 
the  devices  to  water  vapor. 
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LIST  OF  SYMBOLS 

A  area  of  Schottky  diodes.  1  27  x  10'1  cirr  for  diodes 
used  in  this  paper 

A  **  Richardson's  constant,  9.2  A/cnv  -  K'  for  InP 
/  current,  A 

/«  saturation  current,  A 

K  Boltzmann’s  constant,  1.38  x  10  ”  J/K 

N0  carrier  concentration  of  substrate,  cm'1 


a  electronic  charge.  1  602  *  10  '*  C 
A„.v  equivalent  resistance  of  depletion  region  11 
R,  series  resistance,  il 

S  A  surface  atom  involved  in  PET  process 
S  C  n-type  semiconductor  material  involved  m  PET 
process 

T  temperature,  K 

V  voltage  across  diode  and  series  resistance  V 

VD  voltage  across  the  diode,  V 

t  permittivity,  F/cm 

0  ideality  factor 

p  resistivity,  fl-cm 

Schottky' barrier  height,  V 

barrier  fieight  of  enhanced  Schottky  junction  V 
♦vwim  barrier  height  of  normal  Schottky  junction.  V 
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GAS  PHASE  REACTIONS  OF  TRIMETHYLAMINE  A  LANE  IN 
LOW  PRESSURE  ORGANOMETALLIC  VAPOR  PHASE  EPITAXY  OF  AlGaAs 

B.L.  Pitts,  D.T.  Emerson  and  J.R.  Shealy 
OMVPE  Facility,  School  of  Electrical  Engineering 
Cornell  University 
Ithaca,  N.Y.  14853 

Abstract 

We  have  investigated  the  effects  of  gas  phase  reactions  between  trimethy- 
lamine  alane  (TMAA),  triethylgallium  (TEG)  and  arsine  on  AltGa1_IAs  films 
grown  by  low  pressure  Organometallic  Vapor  Phase  Epitaxy.  The  reactor  used 
in  this  study  provides  for  independent  observation  of  the  effects  of  TEG-TMAA 
and  TMAA-arsine  gas  phase  reactions.  Gas  phase  reactions  involving  TMAA  and 
TEG  result  in  the  formation  of  nonvolatile  compounds  upstream,  which  condense 
on  the  reactor  wall,  resulting  in  a  reduction  of  growth  rate  and  a  degradation  of 
the  deposition  uniformity.  The  TMAA-arsine  reaction  produces  a  compositional 
dependence  on  the  gas  phase  stoichiometry  (V/III  ratio).  Both  of  these  effects 
are  more  severe  for  higher  TMAA  fluxes.  High  quality  AlGaAs  with  excellent 
thickness  and  compositional  uniformity  was  produced  by  spatially  separating 
the  TMAA  and  TEG  in  the  gas  phase  which  minimizes  the  parasitic  reactions. 
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The  growth  of  aluminum  containing  IH-V  compound  semiconductors  grown 
by  Qrganometailic  Vapor  £hase  Epitaxy  (OMVPE)  has  traditionally  been  plagued 
with  high  oxygen  and  carbon  incorporation.  A  major  reason  for  these  problems 
is  due  to  the  widely  used  aluminum  source,  irimethylaluminum  (TMA).  TMA 
has  a  strong  aluminum-carbon  bond  and  the  ability  to  form  volatile  aluminum 
alkoxide  compounds  resulting  in  oxygen  contaminated  AiGaAs  layers.1  Recently, 
Jrimethylamine  alane  (TMAA)  has  received  much  attention  as  a  viable  alternative 
aluminum  source  in  both  OMVPE  and  Chemical  Beam  Epitaxy  (CBE).2-6  Pre¬ 
vious  reports  indicate  that  using  TMAA  along  with  Jrigthylgallium  (TEG)  and 
arsine  (AsH3),  under  the  appropriate  growth  conditions  (very  high  V/ffl  ratios 
and  gas  velocities),  can  result  in  the  highest  purity  OMVPE  grown  AiGaAs,2’6 
This  is  believed  to  be  due  to  a  lack  of  direct  aluminum-carbon  bond  in  TMAA  and 
also  its  ability  to  form  involatile  Al-O  compounds  when  reacted  with  oxygen 
and  H2O,  resulting  in  reduced  oxygen  contamination.  Improved  photolumines¬ 
cence  (reduced  donor-to-acceptor  related  transition)  and  mobilities  (77  K  mobility 
exceeding  14,000  cm2 /V  sec  for  Alo.uGao.86  As)  have  been  achieved.2 

Earlier  reports  using  TMAA  in  OMVPE  suggest  that  a  requisite  for  produc¬ 
ing  high  quality  AiGaAs  epitaxial  layers  is  to  avoid  prereactions  between  TEG 
and  TMAA  upstream  from  the  susceptor.2’7  TMAA  has  a  low  thermal  decom¬ 
position  temperature  (--100  °C),  allowing  predeposition  on  the  side  walls  of  the 
reaction  cell.  A  solution  to  these  problems  has  been  to  increase  the  gas  veloc¬ 
ity  which  reduces  the  residence  time  of  the  reactants  in  the  growth  chamber. 
In  addition,  high  V/m  ratios  are  necessary  to  achieve  high  purity  results.  The 
growth  chemistry  using  these  precursors  in  OMVPE  must  be  understood  in  order 
to  optimize  film  quality.  Although  studies  investigating  the  effects  of  gas-phase 
reaction  between  TMAA  and  TEG  in  CBE  have  been  reported,8,9  no  previous 
study  exists  for  OMVPE. 

An  investigation  of  gas  phase  reactions  involving  TMAA  in  low  pressure 
OMVPE  of  AiGaAs  is  reported.  We  have  observed  two  predominant  effects:  one 
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due  to  a  TMAA-AsH3  reaction  yields  a  strong  influence  of  film  composition  with 
the  V/m  ratio,  and  the  other  resulting  from  a  TMAA-TEG  reaction  which  de¬ 
grades  the  deposition  uniformity.  The  effects  of  each  of  these  gas  phase  reactions 
in  the  upstream  portion  of  the  reaction  cell  were  identified  by  spatially  separat¬ 
ing  TMAA  and  TEG  in  the  gas  phase  using  a  multichamber  reaction  cell.10  The 
TMAA-TEG  reaction  has  severe  effects  on  the  quality  of  the  AlGaAs  films  es¬ 
pecially  at  low  V/m  ratios.  Using  the  separated  TMAA  and  TEG  reactant  flux 
approach,  high  quality  AlGaAs  structures  were  produced  by  at  much  lower  V /m 
ratios  than  have  been  previously  reported. 

AlGaAs  layers  were  grown  using  Flow  Modulation  Epitaxy  (FME)U  at  low 
pressure  in  a  vertical  barrel,  multichamber  OMVPE  system,10  illustrated  schemat¬ 
ically  in  Figure  la.  In  this  system  substrates  are  rotated  through  group  IH  rich 
spatially  separated  zones  in  a  uniform  group  V  background  flux  without  valve 
switching.  An  inner  quartz  ampoule  (diameter-cO  separates  the  reactant  fluxes  of 
each  deposition  zone.  Figure  lb  shows  the  flow  modulation  exposure  cycle  for 
each  growth  mode.  In  the  conventional  growth  mode,  the  TEG  and  TMAA  are 
premixed  prior  to  injection  into  the  reaction  cell  while  the  susceptor  is  rotated 
at  0.1  rev/sec.  In  the  spatially  separated  growth  mode,  the  TEG  and  TMAA  are 
injected  into  separate  growth  zones,  minimizing  the  TEG-TMAA  gas  phase  reac¬ 
tions.  For  the  group  III  flux  used  in  this  study,  susceptor  rotation  speeds  greater 
than  1  rev/sec  are  needed  to  produce  sub-monolayer  exposure  cycles  which  re¬ 
sult  in  mixed  alloys.  Rotation  speeds  ranging  from  0.1  to  0.7  rev /sec  were  used 
when  the  reactant  fluxes  were  separated  in  the  vapor.  Raman  spectroscopy  con¬ 
firmed  the  existence  of  short  period  superlattices  (confined  LO  GaAs  and  AlAs 
vibrations)  on  all  samples  produced  with  this  method.  The  degree  of  deposition 
zone  separation  (indicated  by  the  set  of  arrows  in  Figure  lb)  is  proportional  to 
the  amount  of  hydrogen  carrier  gas  injected  between  each  zone.  Because  a  small 
amount  of  zone  intermixing  occurs  in  the  spatially  separated  growth  mode  (see 
Figure  lb),  the  short  period  superlattices  have  graded  interfaces.  In  both  growth 
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schemes,  the  total  gas  flow  was  30  slm,  while  the  gas  velocity  was  maintained 
at  30  cm/s. 

Undoped  AlGaAs  layers  were  grown  using  TMAA,  adduct-purified  TEG 
and  100%  AsH3.  Layer  thicknesses  ranged  from  0.5-2  ^m.  The  substrates,  (100) 
Si-doped  n+  GaAs  and  (100)  semi-insulating  GaAs,  were  first  rinsed  in  organic 
solvents  and  then  etched  in  5H2S04:1H202:1H20  prior  to  growth.  The  TMAA 
was  held  at  23  °C  while  a  H2  flow  of  57  seem  was  passed  through  the  bubbler. 
The  TEG  was  also  held  at  23  °C,  while  the  flow  varied  from  18  to  50  scan.  Both 
TEG  and  TMAA  were  maintained  at  100  torn  The  growth  temperature  varied 
from  635  to  750  °C,  and  the  reactor  cell  pressure  was  76  torn  AlGaAs  films  were 
characterized  by  Hall  measurements,  Raman  spectroscopy  and  photolumines¬ 
cence  (PL).  Low  temperature  (1  K)  PL  was  carried  out  with  samples  submerged 
in  superfluid  He  with  photoe-  station  provided  by  the  514.5  run  line  of  an  Ar+ 
laser.  Raman  spectroscopy  was  used  to  determine  the  aluminum  composition12 
and  the  structure  features  of  the  superlattices.13  Thickness  measurements  were 
made  by  a  combination  of  angle  bevelling  and  staining  and  from  analysis  of 
reflectance  spectra. 

The  V/m  ratio  and  growth  temperature  criteria  for  good  surface  morphol¬ 
ogy  were  investigated  over  the  range  from  635  to  750  °C.  Good  surface  morphol¬ 
ogy  was  realized  for  a  V/m  ratio  as  low  as  unity  over  the  entire  temperature 
range.  All  layers  were  n-type  and  net  carrier  concentrations  were  in  the  low 
IQ15  cm-3  range.  With  growth  temperature  (670  8C)  and  group  III  flux  constant, 
the  A1  mole  fraction  as  determined  by  Raman  scattering12  was  found  to  vary 
with  V/m  ratio  in  the  conventional  premixed  growth  mode.  As  shown  in  Fig¬ 
ure  2,  more  A1  is  incorporated  in  the  film  as  the  V/m  ratio  is  decreased.  For 
low  TMAA  fluxes,  corresponding  to  alloy  compositions  of  asl5%,  AsH3  appears 
to  prevent  the  TEG-TMAA  reaction  which  is  shown  to  reduce  the  TEG  tranport 
to  the  growth  surface.  As  can  be  seen  in  the  PL  spectra  in  the  Figure  4  inset, 
the  sample  quality  degrades  with  deaeasing  V/III  ratio.  At  a  V/m  ratio  of  80, 
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a  full  width  at  half  maximum  exciton  linewidth  of  2.2  meV  was  observed  for 
Alo.isGao.ssAs.  This  compares  favorably  with  the  narrowest  linewidth  ever  re¬ 
ported  at  that  composition  by  Reynolds  ef  al.1*  The  exciton  line  broadened  but 
was  still  clearly  identifiable  when  the  V/m  ratio  was  lowered  to  7.5.  Finally, 
at  a  V/ffl  ratio  of  1,  the  exciton  feature  was  absent.  The  need  for  large  arsine 
flows  may  imply  that  the  TMAA-AsHa  reaction  inhibits  the  TMAA-TEG  reaction 
which  is  demonstrated  to  severely  degrade  the  quality  of  the  AlGaAs  films. 

Gas  phase  reactions  between  TEG  and  TMAA  have  major  effects  on  the 
growth  rate.  As  Figure  3  illustrates,  when  the  TMAA  and  TEG  are  premixed,  the 
AlGaAs  growth  rate  is  approximately  half  that  of  GaAs  with  same  TEG  reactant 
flux.  A  relatively  high  V/m  ratio  was  used  (V/ 111=80)  to  eliminate  the  effects 
of  AsH3  flows  described  earlier.  The  Al  composition  for  AlGaAs  grown  using 
premixed  sources  was  79%  whereas  that  for  the  spatially  separated  sources  was 
nominally  40%.  Assuming  that  the  Al  and  Ga  incorporation  in  the  AlGaAs  layer 
in  the  two  growth  modes  are  equal,  an  estimated  70%  of  the  TMAA  reacts  in  the 
gas  phase  to  produce  nonvolatile  compounds.  The  effect  of  growth  rate  reduction 
was  also  reported  for  CBE  for  premixed  TMAA  and  TEG.8  In  addition,  color 
fringes  were  observed  downstream  along  the  wafer,  indicating  severe  thickness 
nonuniformity  (±16%  over  a  20  mm  diameter).  In  contrast,  excellent  thickness 
uniformity  was  realized  (±1%  over  a  20  mm  diameter)  when  the  TEG  and  TMAA 
are  separated  in  the  vapor.  Although  the  growth  rate  was  roughly  doubled  by 
separating  the  group  HI  reactant  fluxes,  it  was  still  lower  than  that  for  GaAs. 
This  is  likely  due  to  the  partial  intermixing  of  the  growth  zones. 

A  comparison  of  PL  spectra  was  made  between  layers  grown  by  premixed 
and  spatially  separated  growth  modes  for  constant  reactant  flux.  These  experi¬ 
ments  were  performed  at  a  growth  temperature  of  670  °C  and  a  V/III  ratio  of 
80.  As  shown  in  Figure  4,  the  material  grown  with  spatially  separated  group  III 
fluxes  exhibited  three  orders  of  magnitude  higher  PL  intensity  than  the  premixed 
grown  material.  All  material  grown  by  spatially  separating  the  TMAA  and  TEG 
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had  strong  room  temperature  PL,  which  was  difficult  to  observe  when  premixed 
sources  were  used.  A  possible  explanation  for  this  effect  is  that  in  addition  to  the 
TMAA  and  TEG  forming  nonvolatile  compounds,  volatile  compounds  are  also 
present  which  participate  in  the  growth  process  and  incorporate  non-radiative 
centers  in  the  AlGaAs. 

In  conclusion,  we  have  demonstrated  the  effects  of  parasitic  gas  phase  re¬ 
actions  between  TMAA,  TEG  and  AsH3  in  low  pressure  OMVPE  of  AlGaAs. 
The  TMAA-TEG  reaction  the  decreases  the  growth  rate,  degrades  thickness  uni¬ 
formity  and  luminescence  efficiency  particularly  at  moderately  high  Al  compo¬ 
sitions.  This  reaction  results  in  the  formation  of  nonvolatile  compounds,  dra¬ 
matically  reducing  the  TEG  transport  to  the  substrate  surface.  These  effects  were 
greatly  reduced  by  spatially  separating  the  TMAA  and  TEG  to  minimize  parasitic 
gas  phase  reactions.  The  effects  of  V/III  ratio  on  film  quality  and  Al  composi¬ 
tion  have  also  been  determined.  High  V/DI  ratios  are  necessary  to  inhibit  the 
TMAA-TEG  reaction  likely  due  to  a  pre-reaction  with  TMAA  and  AsH3.  The 
AsH3  flow  requirement  for  acceptable  quality  AlGaAs  films  is  sharply  reduced 
using  the  multichamber  flow  modulation  technique. 

The  authors  wish  to  thank  B.  Butterfield,  A.  Schremer  and  K.  Whittingham 
for  technical  assistance.  This  work  was  supported  by  the  Joint  Services  Electron¬ 
ics  Program  under  grant  No.  F49620-90-C-0039,  the  Strategic  Defense  Initiative 
Objective  under  contract  No.  N00014-89-J--1311,  and  the  Defense  Advanced  Re¬ 
search  Projects  Agency  under  contract  No.  MDA97290C0058  Optoelectronics 
Technology  Center. 
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FIGURE  CAPTIONS 

1.  Figure  1:  (a)  Schematic  illustration  of  implementation  of  Flow  Modulation 
Epitaxy  in  the  multichamber  cell.  Two  TEG  sources,  one  on  each  growth 
zone,  allow  for  conventional  premixed  injection  or  spatially  separated  group 
IE  sources.  The  arsine  is  uniformly  injected  around  the  cell.  The  inner  quartz 
ampoule  (diameter  -  d)  serves  to  separate  the  reactant  fluxes  of  each  deposition 
zone,  (b)  The  exposure  cycle  for  premixed  and  spatially  separated  TMAA 
and  TEG.  The  arsine  flow  is  uniformly  distributed  around  the  cell.  Dotted 
lines  represent  the  reactant  flux  zero  reference.  The  degree  of  deposition  zone 
separation  is  indicated  schematically  by  set  of  arrows  in  the  lower  diagram  of 
the  figure. 

2.  Figure  2:  Dependence  of  the  aluminum  composition  (determined  from  Raman 
scattering)  on  V/H[  ratio  for  contant  TEG  and  TMAA  fluxes  at  670  °C.  The 
inset  is  the  corresponding  low  temperature  PL  spectra  for  various  V/ELI  ra¬ 
tios.  The  luminescence  intensity  is  magnified  by  the  factors  shown.  Excitation 
conditions  are  as  indicated. 

3.  Figure  3:  The  growth  rate  of  undoped  AlGaAs  downstream  along  the  wafer 
when  TMAA  and  TEG  are  premixed  prior  to  injection  into  the  growth  chamber 
and  spatially  separated  in  the  gas  phase.  The  nominal  aluminum  composition 
of  the  superlattice  is  0.40.  The  growth  rate  is  normalized  to  GaAs.  The  exper¬ 
imental  conditions  are  as  indicated. 

4.  Figure  4:  Low  temperature  (1  K)  photoluminescence  of  undoped  AlGaAs 
grown  with  TEG,  TMAA  and  AsHa  using  FME.  The  TMAA  and  TEG  were 
either  premixed  prior  to  their  injection  into  the  reaction  cell  or  spatially  sepa¬ 
rated  in  the  multichamber  cell,  as  indicated.  The  luminescence  is  magnified  by 
the  factors  shown.  Excitation  conditions,  growth  conditions  and  superlattice 
periods  are  as  indicated. 
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THE  EFFECTS  OF  GAS  PHASE  REACTIONS  OF  TRIMETHYLAMINE  ALANE  ON 
AlGaAs  FILMS  GROWN  BY  ORGANOMETALLIC  VAPOR  PHASE  EPITAXY 
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Abstract 

The  effect  of  gas  reaction  between  trimethylamine  alane,  triethylgallium  and 
arsine  on  AlGaAs  films  grown  by  Organometallic  Vapor  Phase  Epitaxy  is  re¬ 
ported.  Using  a  multichamber  reaction  cell,  we  have  been  able  to  independently 
observe  the  effects  of  TMAA-TEG  and  TMAA-arsine  gas  phase  reactions.  The  ef¬ 
fects  of  TMAA-TEG  gas  phase  reactions  were  identified  by  comparing  films  that 
were  grown  by  premixing  the  TMAA  and  TEG  prior  to  injection,  to  those  that 
where  the  TMAA  and  TEG  were  spatially  separated.  The  TMAA-TEG  reactions 
results  in  the  formation  of  nonvolatile  compounds  which  condense  upstream 
from  the  reaction  cell,  resulting  in  a  severe  reduction  in  growth  rate,  as  well  a 
depletion  of  the  gallium  species.  The  TMAA-arsine  reaction  produces  compo¬ 
sitional  dependence  on  V/in  ratio.  The  arsine  flow  requirement  for  attaining 
good  surface  morphology  has  been  identified.  Under  the  appropriate  growth 
conditions  we  demostrate  that  acceptable  purity  AlGaAs  can  be  grown  using 
low  V/m  ratios. 
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I.  Introduction 

The  ability  to  produce  high  purity  AlGaAs  material  has  led  to  the  realization 
of  many  high  performance  optical  and  high-speed  electronic  devices.  The  growth 
of  AlGaAs  grown  by  Qrganometallic  Vapor  Phase  Epitaxy  (OMVPE)  has  tradi¬ 
tionally  been  plagued  with  high  oxygen  and  carbon  incorporation.  Although 
many  attempts  have  been  made  to  reduce  these  effect s,  relatively  high  concen¬ 
trations  of  carbon,  and  to  a  lesser  extent  oxygen,  still  persist  using  conventional 
sources1’2.  A  major  reason  for  these  problems  is  due  to  the  widely  used  alu¬ 
minum  source,  trimethylaluminum  (TMA).  TMA  has  a  strong  aluminum-carbon 
bond  and  the  ability  to  form  volatile  aluminum  alkoxide  compounds  resulting 
in  oxygen  contaminated  AlGaAs  layers3.  Triethyl  aluminum  (TEA)  is  also  used 
as  an  aluminum  source,  and  it  has  demonstrated  lower  carbon  incorporation  in 
AlGaAs  than  TMA.  Low  temperature  (<5  K)  mobilities  near  500,000  cm2 /V  sec 
have  been  reported  for  AlGaAs/GaAs  modulation  doped  heterostructure  (sheet 
electron  density  -  8(10n)  cm-2)  using  TEA  and  |riethylgallium  (TEG)4.  Compa¬ 
rable  results  do  not  yet  exist  for  structures  grown  with  TMA  or  trimethylamine 
alane  (TMAA).  However,  some  residual  oxygen  still  remains  using  TEA.  Also, 
TEA  has  a  low  vapor  pressure  (0.5  torr  at  55  °C)  which  is  inconvenient  for 
OMVPE. 

Trimethylamine  alane  (TMAA)  has  received  much  attention  as  a  viable  alu¬ 
minum  source  in  both  OMVPE  and  Chemical  Beam  Epitaxy  (CBE)5-16.  TMAA 
does  not  have  a  direct  aluminum-carbon  which  is  expected  to  reduce  the  carbon 
comtamination.  Also,  when  TMAA  reacts  with  02  and  H20  involatile  Al-O  com¬ 
pounds  form  thereby  reducing  the  oxygen  contamination.  Reports  indicate  that 
using  TMAA  along  with  TEG  or  Jrimethylga  Ilium  (TMG)  and  arsine  (AsH3),  un¬ 
der  the  appropriate  growth  conditions  (very  high  V/m  ratios  and  gas  velocities), 
can  result  in  the  highest  purity  OMVPE  grown  AlGaAs5,7.  Improved  photolu¬ 
minescence  (reduced  donor-to-acceptor  related  transition)  and  mobilities  (77  K 
mobility  exceeding  14,000  cm2/V  sec  for  Alo.uGao.86As)  have  been  achieved5. 


Gas  Phase  Reactions  of  TMAA-  Pitts  et  a!. 

Recently,  high  quality  AlInAs/GalnAs  structures  have  also  been  attained  using 
TMAA.  Low  threshold  lasers  and  high  transconductance  selectively  doped  field 
effect  transistors  have  been  demonstrated  using  TMAA  in  both  AlGaAs/GaAs 
and  AlInAs/GalnAs  material  systems11,12,15. 

Previous  investigators  have  reported  parasitic  reactions  between  TMAA  and 
metal-alkyl  compounds  in  OMVPE5,12,15.  Inferior  compositional  and  thickness 
uniformity  was  realized,  probably  due  to  gas  phase  reactions  between  TMAA 
and  other  reactants5.  Grady  et  al.  performed  Fourier  transform  infrared  (FTTR) 
spectroscopy  on  TMAA/TMG  vapor  mixture  and  reported  the  presence  of  strong 
gas  phase  reactions  between  TMAA  and  TMG  resulting  in  a  depletion  of  gallium 
species17.  TMAA  also  has  a  low  thermal  decomposition  temperature  (~100  °C), 
allowing  predeposition  on  the  side  walls  of  the  reaction  cell.  A  remedy  to  these 
problems  has  been  to  increase  the  gas  velocity  which  reduces  the  contact  time 
between  the  reactants  in  the  growth  chamber.  Hobson  et  al.  used  gas  velocities 
greater  than  1  m/sec  to  overcome  these  effects11.  In  addition,  high  V/III  ratios 
are  necessary  to  achieve  high  purity  results.  Studies  have  been  made  inves¬ 
tigating  the  growth  chemistry  of  CBE  using  TMAA  with  other  organometallic 
compounds13,14.  Notably,  Kobayashi  et  al.  reported  the  effects  of  gas-phase 
reactions  between  TMAA  and  TEG  in  CBE.  They  concluded  that  TMAA-TEG  re¬ 
actions  produced  non-volatile  compounds  which  decreases  the  growth  rate  and 
reduces  gallium  incorporation13. 

This  paper  investigates  the  effects  of  gas  phase  reactions  between  TMAA, 
TEG  and  AsH3  on  AlGaAs  films  grown  by  low  pressure  OMVPE.  The  reactor 
used  in  this  study  provides  for  independent  observation  of  the  effects  of  TEG- 
TMAA  and  TMAA-AsH3  gas  phase  reactions.  Gas  phase  reactions  involving 
TMAA  and  TEG  result  in  the  formation  of  nonvolatile  compounds  upstream, 
which  condense  on  the  reactor  wall,  resulting  in  a  reduction  of  growth  rate  and 
a  degradation  of  the  deposition  uniformity.  The  TMAA-TEG  reaction  has  severe 
effects  on  the  quality  of  the  AlGaAs  films  especially  at  low  V/III  ratios.  The 
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TMAA-AsH3  reaction  produces  a  compositional  dependence  on  the  gas  phase 
stoichiometry  (V/IH  ratio).  High  quality  AlGaAs  with  excellent  thickness  and 
compositional  uniformity  was  produced  by  spatially  separating  the  TMAA  and 
TEG  in  the  gas  phase  which  minimizes  the  parasitic  reactions.  Applying  flow 
modulation  techniques18,19  dramatically  reduces  the  arsine  flow  requirements  for 
producing  acceptable  quality  AlGaAs. 

II.  Experimental 

AlGaAs  layers  were  grown  on  (100)  Si-doped  n+  GaAs  and  (100)  semi- 
insulating  GaAs  substrates  in  a  vertical  barrel,  multichamber  OMVPE  system20. 
The  reaction  cell  is  made  of  6-inch  high  purity  quartz.  The  graphite  susceptor, 
which  can  hold  up  to  18-1.5  inch  wafers,  is  inductively  heated  by  RF  radiation. 
Each  organometallic  line  has  independent  pressure  control  to  enhance  transport 
to  the  reaction  cell.  The  system  is  also  equipped  with  an  in-si tu  quadrapole  mass 
analyzer  to  detech  gas  leaks  before  experiments.  Figure  la  illustrates  the  gas  flow 
in  the  reaction  chamber.  The  reaction  chamber  has  two  growth  zones  that  are 
spatially  separated  by  large  hydrogen  fluxes.  The  substrates  are  rotated  through 
the  growth  zones  without  valve  switching.  An  inner  quartz  ampoule  (diameter-<i) 
separates  the  reactant  fluxes.  Arsine  is  uniformly  injected  into  the  entire  growth 
chamber.  During  the  group  III  exposure  cycle  the  local  V/III  ratio  is  estimated 
to  be  25%  of  the  average  value.  The  V/in  ratio  quoted  throughout  represents 
the  average  V/m  ratio  determined  by  the  total  injected  reactant  fluxes.  The  flow 
modulation  exposure  cycle  for  each  growth  mode  is  shown  in  Figure  lb.  The 
group  III  reactants  are  modulated  while  the  AsH3  exposure  remains  constant. 
In  the  conventional  growth  mode,  the  TEG  and  TMAA  are  premixed  prior  to 
injection  into  the  reaction  cell  while  the  susceptor  is  rotated  at  0.1  rev/sec.  In  the 
spatially  separated  growth  mode,  the  TEG  and  TMAA  are  injected  into  separate 
growth  zones,  minimizing  the  TEG-TMAA  gas  phase  reactions.  For  the  group 
III  flux  used  in  this  study,  susceptor  rotation  speeds  greater  than  1  rev/sec  are 
needed  to  produce  sub-monolayer  exposure  cycles  which  result  in  mixed  alloys. 
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Rotation  speeds  ranging  from  0.1  to  0.7  rev/sec  were  used  when  the  reactant 
fluxes  were  separated  in  the  vapor.  Raman  spectroscopy  confirmed  the  existence 
of  short  period  superlattices  (confined  LO  GaAs  and  AlAs  vibrations)  on  all 
samples  produced  with  this  method.  The  degree  of  deposition  zone  separation 
(indicated  by  the  set  of  arrows  in  Figure  lb)  is  proportional  to  the  amount  of 
hydrogen  carrier  gas  injected  between  each  zone.  Due  to  a  small  amount  of  zone 
intermixing  occurs  in  the  spatially  separated  growth  mode  (see  Figure  lb),  the 
short  period  superlattices  have  graded  interfaces.  The  total  gas  flow  was  30  slm, 
while  the  gas  velocity  was  maintained  at  30  cm/s. 

The  sources  used  were  TMAA,  adduct-purified  TEG23  and  100%  Phoenix 
Research  Grade  AsH3.  Arsine  was  passed  through  Al-Ga-In  melt  to  reduce  the 
oxygen  and  H20  contamination1.  Palladium  diffused  H2  was  used  as  a  carrier 
gas.  The  growth  pressure  was  76  torr.  The  TMAA  was  held  at  23  °C  (vapor 
pressure~2  torr)  while  a  H2  flow  of  57  seem  was  passed  through  the  bubbler. 
The  TEG  was  also  held  at  23  °C  (vapor  pressure-5  torr),  while  the  flow  varied 
from  18  to  50  seem.  Both  TEG  and  TMAA  were  maintained  at  100  >tt.  The 
substrates  were  first  degreased  in  organic  solvents,  then  etched  for  10  minutes  in 
5H2S04:1H202:1H20  prior  to  growth.  The  growth  temperature  varied  from  635 
to  750  °C  and  the  V/III  ratio  was  varied  from  1  to  80.  Layer  thicknesses  ranged 
from  0.5-2  pm. 

Films  were  characterized  by  Hall  measurements,  Raman  spectroscopy,  pho¬ 
toluminescence  (PL)  and  double  crystal  X-ray  diffractometry.  Raman  spec¬ 
troscopy  was  used  to  determine  the  aluminum  composition22  and  the  structure 
features  of  the  superlattices23.  Optical  quality  was  assessed  using  low  (1  K)  and 
room  temperature  photoluminscence  (PL).  Low  temperature  PL  was  carried  out 
with  samples  submerged  in  superfluid  He  with  photoexcitation  provided  by  the 
514.5  nm  line  of  an  Ar+  laser.  Thickness  measurements  were  made  by  a  combi¬ 
nation  of  angle  bevelling  and  staining  and  from  analysis  of  reflectance  spectra. 
A  double  crystal  X-ray  diffractometer  with  a  computer  controlled  X-Y  stage  was 
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used  to  determine  the  layer  composition24  and  map  the  compositional  unifor¬ 
mity  across  the  wafer.  The  X-ray  beam  of  Cu  Kaj  monochromatized  by  (111) 
reflections  from  a  perfect  Si  crystal. 

III.  Results  and  Discussion 

All  layers  were  n-type  and  background  carrier  concentration  in  the  1015cm-3 
range.  This  is  believed  to  be  due  to  Si  impurities  in  TMAA5.  The  V/m  ratio 
and  growth  temperature  criteria  for  good  surface  morphology  were  investigated 
over  die  temperature  range  635  to  750  °C  for  AlxGaj_rAs  (x  <  0.2).  When  the 
substrate  temperature  is  below  675  °C,  more  AsH3  must  be  supplied  to  maintain 
specular  surfaces.  As  the  growth  temperature  is  increased  beyond  675  °C,  the 
good  morphology /bad  morphology  transition  approaches  a  constant  value  of 
unity.  An  analagous  study  has  been  reported  using  TEG  and  AsH325. 

With  growth  temperature  (670  °C)  and  group  III  flux  constant,  the  Al  mole 
fraction  as  determined  by  X-ray  diffraction  and  Raman  scattering  was  found  to 
vary  with  V/III  ratio  in  the  conventional  premixed  growth  mode.  As  shown  in 
Figure  2,  more  Al  is  incorporated  in  the  film  as  the  V/III  ratio  is  decreased.  For 
low  TMAA  fluxes,  corresponding  to  alloy  compositions  less  than  20%,  AsH3  ap¬ 
pears  to  prevent  the  TEG-TMAA  reaction  which  is  shown  to  reduce  the  TEG 
tranport  to  the  growth  surface.  As  the  PL  spectra  in  Figure  3  reveals,  the 
sample  quality  degrades  with  decreasing  V/III  ratio.  At  a  V/III  ratio  of  80, 
a  full  width  at  half  maximum  exdton  linewidth  of  2.2  meV  was  observed  for 
Alo.15Gao.85As.  This  compares  favorably  to  the  narrowest  linewidth  ever  pro¬ 
duced  using  OMVPE26  As  the  V/III  ratio  decreased  to  50,  the  exciton  linewidth 
was  6.5  meV.  The  linewidth  continue  to  broaden  but  was  clearly  identifiable 
when  the  V/m  ratio  was  lowered  to  7.5.  Finally,  at  a  V/III  ratio  of  1,  the  exci¬ 
ton  feature  was  absent,  indicating  that  even  though  morphology  was  good  for 
these  growth  conditions,  material  purity  was  relativley  poor.  The  need  for  large 
arsine  flows  may  suggest  that  the  TMAA-AsH3  reaction  inhibits  the  TMAA-TEG 
reaction  which  is  demonstrated  to  severely  degrade  the  quality  of  the  AlGaAs 
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films. 

Gas  phase  reactions  between  TEG  and  TMAA  have  major  effects  on  the 
growth  rate.  As  Figure  4  illustrates,  when  the  TMAA  and  TEG  are  premixed, 
the  AlGaAs  growth  rate  is  approximately  half  that  of  GaAs  with  same  TEG  re¬ 
actant  flux.  A  relatively  high  V/HI  ratio  was  used  (V/III=40)  to  eliminate  the 
effects  of  AsH3  flows  described  earlier.  The  effect  of  growth  rate  reduction  was 
also  reported  for  CBE  for  premixed  TMAA  and  TEG13.  The  Al  composition  for 
AlGaAs  grown  using  premixed  sources  was  79%  whereas  that  for  the  spatially 
separated  sources  was  nominally  40%.  Assuming  that  the  Al  and  Ga  incorpo¬ 
ration  in  the  AlGaAs  layer  in  the  two  growth  modes  are  equal,  an  estimated 
70%  of  the  TMAA  reacts  in  the  gas  phase  to  produce  nonvolatile  compounds. 
The  reduction  of  gallium  incorporation  seems  to  be  consistent  with  FTIR  results 
on  TMAA/TMG  gas  mixture15.  In  addition,  color  fringes  were  observed  down¬ 
stream  along  the  wafer,  indicating  severe  thickness  nonuniformity  (±16%  over  a 
20  mm  diameter).  In  contrast,  excellent  thickness  uniformity  was  realized  (±1% 
over  a  20  mm  diameter)  when  the  TEG  and  TMAA  are  separated.  Although  the 
growth  rate  was  roughly  doubled  by  separating  the  group  III  reactant  fluxes,  it 
was  still  lower  than  that  for  GaAs,  likely  due  to  the  partial  intermixing  of  the 
growth  zones. 

Figures  6  and  7  are  compositional  uniformity  maps  for  the  premixed  and 
spatially  separated  growth  modes,  respectively.  The  compositional  uniformity 
is  similar  for  both  growth  modes  (±2%).  In  both  cases,  the  aluminum  concen¬ 
tration  decreases  downstream  along  the  wafer.  The  compositional  uniformity  is 
approximately  the  same  for  both  growth  modes  (~±2%  over  40  mm2).  This  is 
consistent  with  other  reports  using  TMAA  and  TEG5. 

The  X-ray  rocking  curves  for  the  premixed  and  spatially  separated  growth 
modes  are  compared  in  Figure  8.  The  premixed  grown  material  exhibited  a  broad 
peak  from  the  epitaxial  layer,  indicative  of  poor  structural  quality.  In  addition, 
long  tail  on  the  substrate  is  present  probably  due  to  the  to  compositional  grading 
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in  the  layer.  In  constrast,  layers  produced  by  the  spatially  separating  the  TMAA 
and  TEG  had  a  peak  that  were  comparable  to  that  of  the  substrate.  Both  curves 
exhibited  extra  peaks  which  is  possibly  due  to  compositional  grading.  This  is 
believed  to  be  due  to  irratic  transport  of  the  TMAA  which  commonly  occurs  in 
solid  organometallic  sources  such  as  trimethylindium27. 

PL  spectTa  was  crmpared  between  layers  grown  by  premixed  and  spatially 
separated  growth  mod.  _  for  constant  reactant  flux.  These  experiments  were  per¬ 
formed  at  a  growth  temperature  of  670  °C  and  a  V/m  ratio  of  80.  As  shown  in 
Figure  9,  the  material  grown  with  spatially  separated  group  HI  fluxes  exhibited 
much  stronger  PL  intensity  than  the  premixed  grown  material.  Material  produce 
from  spatially  separating  the  TMAA  and  TEG  exhibited  strong  room  tempera¬ 
ture  PL,  which  was  difficult  to  observe  when  premixed  sources  were  used.  An 
explanation  for  this  effect  is  that  in  addition  to  the  TMAA  and  TEG  forming 
nonvolatile  compounds,  volatile  compounds  are  also  present  which  participate 
in  the  growth  process  and  reduces  the  radiative  efficiency  in  the  AlGaAs. 

VI.  Summary 

We  have  described  the  effects  of  gas  phase  reactions  between  TMAA,  TEG 
and  AsH3  on  AlGaAs  films  grown  by  OMVPE.  The  TMAA-AsH3  produces  a 
compositional  dependence  on  the  gas  phase  stoichiometry  (V/ni  ratio).  The 
TMAA-TEG  reaction  result  in  the  formation  of  nonvolatile  compounds  which 
reduces  the  growth  rate  and  degrades  the  depostion  uniformity.  Poor  lumi¬ 
nescence  was  observed  suggesting  the  presence  of  volatile  compounds  which 
produce  non-radiative  centers.  These  effects  were  dramatically  reduced  by  spa¬ 
tially  separating  the  reactants.  The  arsine  flow  requirements  has  been  identified 
for  yielding  good  surface  morphology  AlGaAs  using  TMAA,  TEG  and  AsH3. 
Finally,  we  have  demonstrated  that  under  the  appropriate  growth  condition,  ac¬ 
ceptable  quality  AlGaAs  can  be  produced  using  low  V/III  ratios. 
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FIGURE  CAPTIONS 

1.  Figure  1:  (a)  Schematic  illustration  of  implementation  of  Flow  Modulation 
Epitaxy  in  the  multichamber  cell.  Two  TEG  sources,  one  on  each  growth 
zone,  allow  for  conventional  premixed  injection  or  spatially  separated  group 
III  sources.  The  arsine  is  uniformly  injected  around  the  cell.  The  inner  quartz 
ampoule  (diameter  -  cO  serves  to  separate  the  reactant  fluxes  of  each  deposition 
zone,  (b)  The  exposure  cycle  for  premixed  and  spatially  separated  TMAA 
and  TEG.  The  arsine  flow  is  uniformly  distributed  around  the  cell.  Dotted 
lines  represent  the  reactant  flux  zero  reference.  The  degree  of  deposition  zone 
separation  is  indicated  schematically  by  set  of  arrows  in  the  lower  diagram  of 
the  figure. 

2.  Figure  2:  Dependence  of  the  aluminum  composition  (determined  from  Raman 
scattering)  on  V/III  ratio  for  contant  TEG  and  TMAA  fluxes  at  670  °C. 

3.  Figure  3:  Low  temperature  PL  spectra  for  various  V/m  ratios  at  670  °C  and 
cor. .ant  TEG  and  TMAA  fluxes.  Compositional  variation  is  due  to  TMAA- 
arsine  gas  phase  reaction.  The  luminescence  intensity  is  magnified  by  the 
factors  shown.  Excitation  conditions  are  as  indicated. 

4.  Figure  4:  The  growth  rate  of  undoped  AlGaAs  downstream  along  the  wafer 
when  TMAA  and  TEG  are  premixed  prior  to  injection  into  the  growth  chamber 
and  spatially  separated  in  the  gas  phase.  The  nominal  aluminum  composition 
of  the  superlattice  is  0.40.  The  growth  rate  is  normalized  to  GaAs.  The  exper¬ 
imental  conditions  are  as  indicated. 

5.  Figure  5:  Compositional  uniformity  across  a  20X20  mm  wafer  for  AlGaAs 
grown  by  premixing  the  TMAA  and  TEG  at  670°C  and  a  V/III  of  40.  The  map 
was  constructed  from  X-ray  diffraction  close  to  the  (004)  reflection. 

6.  Figure  6:  Compositional  uniformity  across  a  12X16  mm  wafer  for  AlGaAs 
grown  by  spatially  separated  TMAA  and  TEG  at  670°C  and  a  V/III  of  40.  The 
map  was  constructed  from  X-ray  diffraction  close  to  the  (004)  reflection. 
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7.  Figure  7:  X-ray  rocking  curve  of  AlGaAs  grown  by  either  premixed  prior  to 
their  injection  into  the  reaction  cell  or  spatially  separated  in  the  multichamber 
cell,  as  indicated.  The  nominal  Al  compositions  of  the  layer  grown  by  premixed 
and  spatially  separated  growth  modes  were  70%  and  26%,  respectively.  Both 
layer  were  grown  at  670°C  and  a  V/III  of  40.  The  X-ray  diffraction  was  taken 
close  to  the  (004)  reflection. 

8.  Figure  8:  Low  temperature  (1  K)  photoluminescence  of  undoped  AlGaAs 
grown  with  TEG,  TMAA  and  ASH3  using  FME.  The  TMAA  and  TEG  were 
either  premixed  prior  to  their  injection  into  the  reaction  cell  or  spatially  sepa¬ 
rated  in  the  multichamber  cell,  as  indicated.  The  luminescence  is  magnified  by 
the  factors  shown.  Excitation  conditions,  growth  conditions  and  superlattice 
periods  are  as  indicated. 
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We  report  the  generation  of  high-repetition-rate  femtosecond  pulses  in  the  blue  by  intracavity  doubling  of  a 
mode-locked  Thsapphire  laser  using  £-BaB204.  To  reduce  the  pulse-broadening  effect  of  group-velocity  mis¬ 
match,  an  extremely  thin  /3-BaB204  crystal  is  used.  By  pumping  the  Tbsapphire  laser  with  4.4  W  of  power 
from  an  Ar*  laser,  as  much  as  230  mW  of  430-nm  light  is  produced  at  a  72-MHz  repetition  rate  and  a  89-fs 
pulse  width.  This  represents  an  effective  conversion  efficiency  of  -  75%  from  the  typical  infrared  output  to  the 
second  harmonic.  Pulse  widths  as  short  as  54  fs  are  achieved  for  the  blue  output. 


Extension  of  the  wavelength  range  accessible  to 
femtosecond  pulses  has  been  a  topic  of  much  inter¬ 
est.  The  two  techniques  used  most  frequently  to 
generate  <100-fs  pulses  at  otherwise  unattainable 
wavelengths  are  continuum  generation  and  fre¬ 
quency  conversion  with  the  use  of  crystals.  Fem¬ 
tosecond  pulse  generation  techniques  based  on 
amplification  followed  by  continuum  generation  per¬ 
mit  tunability  from  the  UV  into  the  IR.1  However, 
amplification  reduces  the  pulse  repetition  rate  to 
the  order  of  a  kilohertz,  and  there  is  often  a  loss  of 
time  resolution  in  the  final  pulse.  In  contrast,  fre¬ 
quency  conversion  in  crystals  can  maintain  the  high 
repetition  rate  of  the  femtosecond  megahertz-rate 
laser  and  requires  only  a  single  cw  pump  laser.  The 
higher  repetition  rate  results  in  much  smaller  pulse 
fluctuation  and  excellent  experimental  signal-to- 
noise  ratios. 

In  recent  years,  much  progress  has  been  made  in 
extending  the  spectral  range  of  high-repetition-rate 
femtosecond  pulses  throughout  the  visible  and  IR 
by  using  frequency  conversion  in  crystals.  The 
80-MHz  femtosecond  optical  parametric  oscillator 
permits  broad  tunability  throughout  the  near  IR 
and  mid-IR.2,3  High-repetition-rate  femtosecond 
pulse  generation  in  the  UV  and  blue-green  has  been 
somewhat  more  limited.  Colliding-pulse  mode- 
locked  (CPM)  lasers  have  directly  generated 
<I00-fs  pulses  in  the  range  of  493  to  554  nm  at 
milliwatt  outputs/5  and  intracavity  doubling  of 
the  Rhodamine  6G/diethyloxadiacarbocyanine  iodide 
(Rh6G/DODCI)  CPM  dye  laser  has  resulted  in  a 
100-MHz  source  of  femtosecond  pulses  with  milli¬ 
watt  outputs  in  the  310-315-nm  range.  The 
Rh6G/DODCI  CPM  laser  was  first  intracavity 
doubled  by  using  KDP.8  Soon  thereafter,  /3-BaB?04 
(BBO)  was  used  to  intracavity  double  the  CPM  laser 
with  a  per-pass  conversion  efficiency  as  high  as 
5.5%,  which  generated  20  mW  of  UV  output  per  arm 
with  <  100-fs  pulse  widths,  and  pulse  widths  as  short 
as  43  fs.‘  This  gives  an  effective  conversion  effi¬ 


ciency  of  nearly  100%  from  the  typical  CPM  output 
in  the  red  to  the  UV 

While  the  standard  Rh6G/DODCI  CPM  dye  laser 
operates  at  a  wavelength  slightly  shorter  than  the 
tuning  range  of  the  Ti:sapphire  laser,  the  broad  tun¬ 
ability,  the  high  average  output  power,  and  the  obvi¬ 
ous  advantages  of  a  solid-state  laser  have  made  the 
dispersion-compensated  mode-locked  TTsapphire 
laser8  an  extremely  attractive  replacement  for 
the  CPM  dye  laser.  At  present,  the  mode-locked 
Ti:sapphire  laser  can  potentially  operate  with 
<200-fs  pulse  widths  and  >100-mW  average  power 
over  the  range  of  700  to  1053  nm.9  Frequency  dou¬ 
bling  over  this  spectral  range  provides  femtosecond 
pulses  from  350  to  525  nm.  Doubling  of  the 
Tbsapphire  laser  outside  the  cavity  has  been  re¬ 
ported.10  The  best  conversion  efficiency  of  25% 
was  achieved  at  750  nm,  although  no  second- 
harmonic  pulse  widths  were  reported  and  the  length 
of  the  doubling  crystal  was  not  given.  The  group- 
velocity  mismatch  for  type  I  second-harmonic  gen¬ 
eration  (SHG)  in  BBO  at  750  nm  is  225  fs/mm,  and 
in  order  to  maintain  the  narrowest  temporal  pulse 
width  a  thin  doubling  crystal  is  required.  Use  of  a 
thin  crystal  therefore  necessitates  a  high  peak 
power  to  achieve  high  conversion  efficiency,  and 
thus  intracavity  doubling  is  required  to  achieve 
simultaneously  the  shortest  pulses  and  the  highest 
power  in  the  second  harmonic.  As  discussed  fur¬ 
ther  below,  extremely  high  intracavity  conversion  ef¬ 
ficiency  is  possible,  which  would  result  in  UV,  blue, 
or  green  outputs  of  hundreds  of  milliwatts  average 
power.  Using  an  extremely  thin  (55  Mm)  crystal  of 
BBO,  we  demonstrate  a  72-MHz  repetition-rate 
source  of  blue  pulses  of  89-fs  duration  (FWHM)  and 
115  mW  of  power  per  arm  (two  arms  of  BBO;  see 
Fig.  1).  Reducing  the  pulse  width  for  the  blue  to 
54  fs,  we  measure  45  mW  of  power  per  arm. 

Figure  1  shows  a  schematic  of  the  dispersion- 
compensated  intracavitv-doubled  Tbsapphire  laser. 
The  SF-10  prisms  are  spaced  50  cm  tip  to  tip.  The 
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Fig.  1.  Schematic  of  the  intracavity  doubled  Ti:sapphire 
laser.  XTAL,  Ti'sapphire  crystal;  G’s,  gain  mirrors;  L, 
focusing  lens;  P’s.  SF-10  prisms;  M’s,  flat  mirrors;  D, 
dichroic  mirror;  BBO,  doubling  crystal;  S,  tuning  slit;  OC, 
output  coupler. 

argon  pump  laser  is  focused  by  a  10-cm  focal-length 
lens  through  one  of  the  r  -  10  cm  gain  mirrors  onto 
the  18-mm-long  titanium-doped  (0.19£)  sapphire 
crystal.  The  additional  intracavity  focus  at  the 
BBO  crystal  consists  of  r  =  5  cm  dichroic  mirrors 
(fused-silica  substrates,  R  =  10091  at  860  nm, 
T  =  7091  at  430  nm).  The  outcoupler  has  T  —  191 
for  the  IR  and  was  replaced  by  a  high  reflector  when 
the  highest  power  in  the  blue  was  generated.  Be¬ 
fore  insertion  into  the  laser  cavity,  the  crystal  is 
aligned  for  maximum  SHG  conversion  efficiency  in 
the  extracavity  beam  of  the  mode-locked  Tbsapphire 
laser  operating  at  the  intended  doubling  wavelength 
of  —860  nm.  The  proper  alignment  of  the  BBO  can 
be  preserved  on  insertion  into  the  laser  cavity. 

Pulse-width  measurements  for  both  the  funda¬ 
mental  (IR)  and  the  second-harmonic  light  are  made 
by  autocorrelation  with  collinear  type  I  SHG  in 
BBO.  The  BBO  crystal  used  to  measure  the  IR 
autocorrelation  has  a  thickness  of  0.8  mm  and  is  cut 
for  a  phase-matching  angle  of  (f  =  27.5°.  The  BBO 
crystal  used  to  measure  the  blue  pulse  widths  has  a 
thickness  of  0.67  mm  and  is  cut  at  8  =  69°.  The 
second  harmonic  of  the  blue  (215  nm,  the  fourth 
harmonic  of  the  Thsapphire)  is  passed  through  a 
0.2-m  monochromator  and  detected  by  a  solar-blind 
photomultiplier  tube.  The  spectra  for  the  funda¬ 
mental  and  second-harmonic  outputs  from  the  laser 
are  measured  by  using  a  0.2  5-m  monochromator  to 
disperse  the  light  onto  an  optical  multichannel 
analyzer. 

We  point  out  that  the  type  I  SHG  cutoff  wave¬ 
length  in  the  blue  for  BBO  is  409  nm.  Below  this 
wavelength,  accurate  pulse-width  measurement  re¬ 
quires  a  more  difficult  technique  such  as  cross 
correlating  the  fundamental  beam  with  the  second- 
harmonic  beam  by  using  phase-matched  sum- 
frequency  generation.  Owing  to  the  significant 
group-velocity  mismatch  between  the  fundamental 
and  second-harmonic  pulses  for  fundamental  wave¬ 
lengths  below  820  nm  (the  group-velocity  mismatch 
is  >170  fs/mm  for  BBO  at  Air  =  820  nm  and 
increases  for  shorter  wavelengths),  a  thin  cross¬ 
correlation  crystal  is  required.7  Thus,  for  the 
convenience  of  using  collinear  type  I  SHG  autocor¬ 
relation  to  measure  the  pulse  width  of  the  doubled 
light,  we  operated  the  TPsapphire  laser  at 
A  >  820  nm. 


The  intracavitv-doubled  mode-locked  laser  is 
started  by  a  slight  mechanical  perturbation,  usually 
by  a  small-amplitude,  gentle  back-and-forth  transla¬ 
tion  of  one  prism.  Once  well  aligned,  the  mode- 
locked  laser  operates  stably  indefinitely  (observed 
for  as  much  as  —6  h),  although  significant  mechani¬ 
cal  perturbation  can  stop  mode-locked  operation. 
The  mode  locking  generally  is  not  self-starting. 
Variation  of  the  intracavity  dispersion  compensation 
permits  control  of  the  pulse  width.  On  starting,  the 
laser  is  pushed  to  shorter  pulses  simply  by  adding 
prism  glass  and  adjusting  the  focusing  slightly  to 
maintain  high  stability.  While  the  laser  stability  is 
excellent  even  at  the  longer  pulse  widths,  the  oscillo¬ 
scope  trace  of  the  IR  mode-locked  pulse  train  indi¬ 
cates  somewhat  quieter  operation  as  the  pulse  width 
is  decreased.  The  spatial  mode  of  the  fundamental 
beam  is  TEM0o  with  faint,  simple  higher-order 
modes  superimposed.  The  blue  beam  mode  is  a 
clean  TEMoo  that  shows  no  sign  of  higher-order 
modes,  thus  verifying  that  the  power  of  the  funda¬ 
mental  lies  almost  entirely  in  the  TEM^  mode. 

When  the  laser  is  run  with  a  high  reflector  in 
place  of  the  outcoupler,  107-fs  IR  pulses  produce 
230  mW  of  second  harmonic.  Without  the  intracav¬ 
ity  doubling  crystal,  the  maximum  output  of  the 
mode-locked  Tbsapphire  laser  operating  at  860  nm 
is  —300  mW  for  4.4-W  pump  power;  thus  generation 
of  230  mW  of  blue  power  gives  an  effective  conver¬ 
sion  efficiency  of  —  759f  from  the  IR  output  typical 
at  this  pump  power.  The  dichroic  mirrors  transmit 
—  72  mW  of  power  per  arm  of  the  blue  second- 
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Fig.  2.  (a)  Autocorrelation  data  for  the  fundamental  and 
second-harmonic  pulses  in  the  longer-pulse  limit.  The 
FWHM  for  the  fundamental  is  107  fs,  and  for  the  second 
harmonic  it  is  89  fs.  (b)  Spectra  for  the  fundamental  and 
second-harmonic  beams.  The  FWHM  for  the  fundamen¬ 
tal  is  12.7  nm,  which  gives  ArAf  =  0.55,  and  for  the  second 
harmonic  it  is  4.9  nm,  which  gives  AtA*  -  0.71. 
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Fig.  3.  (a)  Autocorrelation  data  for  the  fundamental  and 
second-harmonic  pulses  for  the  shortest  second-harmonic 
pulses.  The  FWHM  for  the  fundamental  is  93  fs,  and  for 
the  second  harmonic  it  is  54  fs.  (b)  Spectra  for  the  fun¬ 
damental  and  second-harmonic  beams.  The  FWHM  for 
the  fundamental  is  18.6  nm,  and  for  the  second  harmonic 
it  is  7.7  nm.  This  gives  AeAt  =  0.70  for  the  fundamental 
and  Aj  A/  =  0.67  for  the  blue  second-harmonic  pulses. 

harmonic  light.  On  compression  of  the  blue  pulses 
by  a  dispersion-compensating  prism  pair,  a  pulse 
width  of  89  fs  is  measured  (see  Fig.  2).  The  prism 
pair  allows  compensation  for  the  dispersion  of  the 
dichroic  mirror  substrate  and  of  other  intracavity 
optics  as  well  as  for  any  upchirp  that  the  pulses  may 
have  on  generation  in  the  intracavity  BBO  crystal. 
The  IR  pulses  are  not  extracavity  dispersion  com¬ 
pensated.  The  spectral  FWHM’s  of  the  IR  and  blue 
are  12.7  and  4.9  nm,  respectively,  which  give  Ai df  ~ 
0.55  for  the  IR  and  Aidf  =  0.71  for  the  blue  pulses. 
Pulse  widths  and  time-bandwidth  products  are  de¬ 
termined  assuming  a  sech2(f)  intensity  envelope. 

We  achieved  the  shortest  blue  pulses  when  run¬ 
ning  the  laser  with  a  1  %  outcoupler  in  place  of  the 
high  reflector  and  operating  closer  to  net  zero  intra- 
cavity  group-velocity  dispersion  (see  Fig.  3).  The 
power  of  the  IR  coupled  out  is  27  mW,  whereas  the 
blue  power  transmitted  by  the  dichroic  mirrors 
is  —31  mW  per  arm.  The  extracavity  dispersion- 
compensated  blue  pulses  have  a  FWHM  of  54  fs  and 
a  spectral  FWHM  of  7.7  nm,  which  gives  Aid t  = 
0.67.  The  IR  pulses  (which  again  are  not  extracavity 
dispersion  compensated)  have  a  pulse  width  of  93  fs 
and  a  spectral  FWHM  of  18.6  nm,  which  yields 
Ai &t  -  0.70.  It  is  believed  that  the  IR  pulses  may 
be  compressed  by  an  extracavity  two-prism  se¬ 
quence,  and  we  hope  to  verify  this  in  the  near  future. 
Again,  a  sech'(f)  intensity  envelope  is  assumed. 

The  observed  intracavity  SHG  conversion  effi¬ 
ciency  of  3.2%  per  pass  for  the  shortest  blue  pulses 
agrees  well  with  the  theory  (3.5%)  for  conversion  by 
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a  nondepleted  pump  wave.11  Without  the  intracav¬ 
ity  bB0  crystal,  we  have  observed  stable  mode- 
locked  operation  for  <l00-fs  pulses  at  intracavity 
powers  as  high  as  8  W  For  the  same  focusing  and 
BBO  crystal  length  presented  here,  8  W  of  intra¬ 
cavity  power  at  a  110-fs  pulse  width  would  yield  a 
more  than  fourfold  increase  in  the  output  of  the  sec¬ 
ond  harmonic,  or  —500  mW  of  blue  light.  For  this 
case,  the  peak  intracavity  intensity  at  the  focus 
would  approach  the  reported  single-shot  damage 
threshold  for  BBO  of  50  GW/cm  .  However,  this 
threshold  pertains  to  pulses  of  8-ns  duration,  and  we 
expect  the  threshold  to  increase  by  orders  of  magni¬ 
tude  for  the  100-fs  pulse-width  regime.  The  aver¬ 
age  intensity  is  orders  of  magnitude  below  the 
long-term  damage  threshold  for  BBO.u 

In  conclusion,  we  have  demonstrated  highly  effi¬ 
cient  intracavity  doubling  of  a  mode-locked 
TLsapphire  laser  that  yields  a  source  of  femtosecond 
pulses  in  the  blue  with  the  same  high  repetition  rate 
of  72  MHz,  short  pulse  width,  excellent  beam  qual¬ 
ity,  and  power  in  the  blue  representing  appreciable 
recovery  of  the  typical  IR  output  at  this  4.4-W  pump 
level.  This  research  represents  an  extension  of  in¬ 
tracavity  doubling  to  solid-state  mode-locked  lasers 
and  results  in  a  source  of  femtosecond  pulses  poten¬ 
tially  tunable  from  the  near  UV  into  the  green,  thus 
broadly  expanding  the  potential  spectral  range  for 
femtosecond  pulses. 

The  authors  thank  W.  S.  Pelouch,  P  E.  Powers,  and 
D.  C.  Edelstein  for  helpful  conversations.  This 
research  was  supported  by  the  Joint  Services 
Electronics  Program  and  the  National  Science 
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A  broadly  tunable  femtosecond  optical  parametric  oscillator  (OPO>  based  on  KTiOPO.  that  is  eaiernallv 
pumped  by  a  self-mode-locked  Ti  aappmre  laser  is  described  Continuous  tuning  is  demonstrated  from  1  22  to 
1,37  pm  in  the  signal  branch  and  from  1  82  to  2  15  pm  m  the  idler  branch  by  using  one  set  of  OPO  optics  The 
potential  tuning  range  of  the  OPO  is  from  1  0  to  2  75  *im  and  requires  three  sets  of  mirrors  and  two  crystals 
Without  prisms  in  the  OPO  cavity  340  mW  (475  mWi  of  chirped-pulse  power  is  generated  in  the  signal  ‘idler 
branch  for  2.5  W  of  pump  power  The  total  conversion  efficiency  as  measured  by  the  pump  depletion  is  55", 
With  prisms  in  the  cavity,  pulses  of  135  fs  are  generated,  which  can  be  shortened  to  75  fs  by  increasing  the  out 
put  coupling 


Optical  parametric  oscillators  (QPO’s)  have  recently 
been  exploited  in  the  femtosecond  time  domain  as  a 
source  of  broadly  and  continuously  tunable  radia¬ 
tion.  The  lack  of  suitable  pump  sources  has  ham¬ 
pered  the  development  of  femtosecond  OPO's  that 
operate  with  short  pulse  widths,  a  high  repetition 
rate,  and  high  output  powers.  The  high  peak  power 
at  the  intracavitv  focus  of  a  colliding-pulse  mode- 
locked  dye  laser  was  exploited  to  develop  the  first 
femtosecond  OPO  l  J  This  resulted  in  2105-fs, 
80-MHz  pulses  at  approximately  3  mW  of  output 
power.  Other  researchers  resorted  to  a  Q-switched 
and  mode-locked  laser  (300  pulses  at  15  Hz)  to  pump 
n  OPO  producing  >160-fs  pulses  (65  fs  at  one  wave¬ 
length)  at  4.5  mW  of  average  power.*  More  recently 
a  femtosecond  OPO  was  reported  that  was  exter¬ 
nally  pumped  by  a  hybridly  mode-locked  dye  laser 
producing  220-fs  pulses  at  30  mW  of  average  power.5 
In  this  Letter  we  describe  a  Ti  sapphire-pumped 
OPO  capable  of  producing  75-fs  pulses  at  a  high 
repetition  rate  (90  MHz)  and  hundreds  of  milliwatts 
of  average  output  power.  We  believe  that  these  are 
the  shortest  tunable  pulses  ever  generated  from  an 
OPO. 


The  Ti  sapphire  pump  laser  is  configured  in  a 
ln«ar  cavity  with  a  18-mm  titanium-doped  <G.l(7c) 
^Pphire  crystal  and  SF-14  prisms  (spaced  at  40  cm) 
°r  dispersion  compensation.  The  crystal  is 
ounted  in  a  copper  block  and  cooled  by  using  a 
merm°«lectric  cooler  with  temperature  feedback  to 
»s  spif310  3  constant  20°C  temperature.  The  laser 
•nt  as  described  elsewhere  in  the  iit- 

e  anj  Produces  2.5  W  of  I25-fs  pulses  in  a 
laser  '0  a™  k  wken  Pumped  by  a  15-W  argon-ion 
Fig.  1.  Tmal‘c  the  OPO  cavity  is  shown  in 
a  1.15-mm  KTpSappklre  *aser  beam  is  focused  onto 
v  axis  uslnear  Cl^slai  w,th  polarization  along  the 
pump  suffers  5  Cm  curved  hi8h  reflector.  The 
for  each  side  ofP/h ^x,mat*ly  a  5%  transmission  loss 
at  0  =  47  5'  aL  ?  Cry.Sotal-  The  KTP  ^stal  is  cut 
*  0  for  type  n  phase  matching 


(o  —  e  ♦  oi  and  coated  with  a  250-nm  layer  of  MgF. 
on  both  sides  for  high  transmission  centered  at 
1.3  (im.  The  OPO  cavity  uses  two  r  —  10  cm  curved 
mirrors  that  are  aligned  for  oscillation  in  the  x -z 
plane  of  the  crystal  to  provide  compensation  for 
walk-off  between  the  Povnting  vectors  of  the  pump 
and  the  resonated  signal  branch  1  The  cavity  may 
be  aligned  with  or  without  the  SF-14  prism  sequence 
simply  by  lowering  or  raising  the  prism  assembly 
The  output  coupler  is  IT.  and  the  other  flat  mirror 
is  mounted  on  a  piezoelectric  transducer  for  fine 
length  adjustment.  A  linear  cavity  design  was  cho¬ 
sen  so  that  the  pump  can  be  retrorefiected  for 
double-pass  pumping  of  the  KTP  crystal .*  This 
would  result  in  parametric  gain  for  the  signal  in  both 
directions  through  the  crystal  when  the  retrore- 
flected  pump  pulses  overlap  the  signal  pulses  in  the 
crystal.  However,  this  requires  that  an  optical  iso¬ 
lator  be  inserted  between  the  pump  iaser  and  the 
OPO  to  reject  feedback  into  the  TLsapphire  cavity 

The  OPO  is  aligned  by  monitoring  the  spontaneous 
parametric  scattering  using  a  liquid-nitrogen-cooled 
germanium  photodiode  [the  peak  detectivity  is 
— 10‘3  cm  Hz1 1  W"1  at  1.5  Mm].  This  signal  is  maxi¬ 
mized  by  adjusting  the  OPO  mirrors  and  focusing 
such  that  the  spontaneous  parametric  scattering 
makes  many  round  trips  in  the  cavity  Oscillation 
occurs  when  the  cavity  length  of  the  OPO  is  matched 
to  that  of  the  pump  laser  cavity;  the  length  mis¬ 
match  becomes  more  sensitive  near  threshold. 

With  2.5  W  of  pump  power  (125  fs)  the  OPO  pro¬ 
duces  as  much  as  340  mW  of  power  in  the  signal 
branch  through  the  IT-  output  coupler.  We  have  mea¬ 
sured  60  mW  of  signal  energy  reflected  from  the 
KTP  crystal  in  one  direction  (120-mW  loss  per  round 
trip),  which  implies  a  transmission  loss  of  0.2T. 
Thus  460  mW'  of  power  is  generated  m  the  signal 
branch  with  an  effective  output  coupler  of  1.  «T.  In 
the  idler  branch  we  have  coupled  out  475  mW  of 
power,  but  this  may  be  limited  by  the  phvsicaj  con¬ 
straints  of  collecting  and  collimating  the  diverging 
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Fig  1-  Schematic  of  the  OPO  cavity  in  the  vertical  plane. 
The  TL  sapphire  pump  (P)  is  focused  onto  the  1.15-mm 
KTP  crystal.  An  enlarged  view  of  the  crystal  is  depicted 
above  and  shows  the  orientation  for  type  II  phase  match¬ 
ing  at  the  phase-matching  angle  9pm.  The  signal  branch 
(S)  is  resonated  by  using  a  19  output  coupler  and  a 
piezoelectric  transducer  (PZT)  for  fine  length  adjustment. 
The  idler  (I)  exits  from  the  crystal  at  ~6  deg  from  the 
signal  The  prism  sequence  may  be  raised  to  allow  oscil¬ 
lation  without  the  prisms. 


idler  radiation  that  is  generated  at  -6  deg  (external 
to  the  crystal)  from  the  signal.  The  pump  is  de¬ 
pleted  by  559  when  the  OPO  is  oscillating  and  is  a 
measure  of  the  actual  conversion  efficiency;  this 
value  agrees  well  with  the  measured  power  output  of 
the  OPO  if  the  crystal  reflections  and  the  pump 
transmission  losses  are  taken  into  account.  Double¬ 
pass  pumping  has  not  yet  been  implemented  in  the 
OPO  since  excellent  conversion  efficiency  has  al¬ 
ready  been  achieved.  If  only  one  pass  of  the  pump 
were  used,  then  a  ring  cavity  would  provide  less  loss 
than  the  linear  cavity. 

Interestingly,  the  OPO  also  produces  output  at 
two  other  non-phase-matched7  wavelengths  that  cor¬ 
respond  to  coilinear  second-harmonic  generation  of 
the  signal  branch  ie  +  e  —  e)  and  noncollinear  sum- 
frequency  generation  between  the  pump  and  the  sig¬ 
nal  (o  -*■  e  — »  o).  For  a  pump  wavelength  of  780  nm 
and  a  signal  wavelength  of  1300  nm  the  second- 
harmonic  wavelength  is  650  nm  and  the  sum- 
frequency  wavelength  is  485  nn«.  A  total  of  almost 
100  mW  of  second-harmonic  power  is  generated 
(50  mW  in  each  direction),  but  only  10  mW  gets 
transmitted  through  the  infrared  optics  and  output 
coupler.  The  coilinear  second  harmonic  could  be 
utilized  for  experimental  purposes  and  is  also 
useful  for  aligning  the  signal  through  extracavity 
optics,  after  which  it  can  be  easily  filtered  out. 
100  pi W  of  sum-frequency  light  was  measured  after 
the  output  coupler.  In  ail,  the  OPO  system  produces 
synchronized  femtosecond  radiation  at  five  different 
wavelengths. 

Without  prisms  in  the  OPO  cavity  the  autocorrela¬ 
tion  and  spectra  show  signs  of  significant  chirp. 
The  pulse  width  as  measured  from  the  intensity 
autocorrelation  is  approximately  500  fs  owing  to 
the  long  decay  time  of  the  wings.  With  prisms  in 
the  OPO  cavity  two  regimes  are  encountered.  For 
net  negative  group-velocity  dispersion  (GVD)  the 
pulses  are  unchirped  with  a  minimum  pulse  width 
of  135  fs  (fit  to  a  sechJ  shape)  and  have  a  smooth 
spectrum  (AmIt  =■  0.45)  [see  Figs.  2(a)  and  2(b)], 
For  net  positive  GVD  the  pulses  are  slightly  chirped 
with  a  broader  pulse  width  and  a  split  spectrum  [see 


Figs.  2(c)  and  2(d)].  Near  zero  GVD  the  OPO  mav 
abruptly  flip  into  either  the  chirped  or  unchirped 
mode.  This  behavior  is  in  contrast  to  the  observed 
smooth  transition  between  operation  with  net  nega¬ 
tive  and  positive  GVD  of  the  OPO  reported  in  Ref.  2 
Therefore  a  nonlinear  chirp  must  be  generated  in 
the  KTP,  which  accounts  for  the  runaway  condition 
in  the  positive-GVD  regime.  This  would  also  ex¬ 
plain  why  the  time-bandwidth  product  is  459 
greater  than  the  transform  limit  for  the  minimum 
pulse  width.  This  effect  is  most  likely  due  to  self¬ 
phase  modulation  of  the  signal  in  the  crystal  as  a 
result  of  the  high  mtracavity  intensity  and  large 
nonlinear  index  of  KTP.  Self-phase  modulation  in 
KTP  was  identified  as  a  source  of  broadening  of  the 
pump  laser  in  Ref.  1  and  is  consistent  with  the 
shape  of  the  signal  spectrum  in  Fig.  2(o. a  It  is  ex¬ 
pected  that  the  pulse  widths  are  approximately  con¬ 
stant  over  the  tuning  range  owing  to  the  relatively 
constant  inverse  group-velocity  mismatch  between 
the  pump  and  the  signal.  The  larger  mismatch  for 
the  idler  suggests  pulse  widths  approximately  509 
greater  than  the  signal. 

It  was  also  observed  in  the  unchirped  regime  that 
a  slight  detuning  of  the  length  shortened  the  pulse 
widths  to  approximately  75  fs  (and  reduced  the  out¬ 
put  power  by  259).  The  pulse  width  was  also  de¬ 
creased  to  75  fs  by  increasing  the  output  coupling  at 
constant  zero  detuning.  This  was  achieved  by  in¬ 
serting  a  thin  glass  flat  in  the  OPO  cavity  and  ro¬ 
tating  it  away  from  Brewstpr's  angle,  effectively 
reducing  the  intracavity  power  by  increasing  the 
output  coupling  to  1.59  (plus  0.49  from  the  crystal). 
Therefore  this  pulse  shortening  results  from  a  de¬ 
crease  in  intracavity  power  as  the  OPO  is  operated 
closer  to  threshold,  as  predicted  by  theory.*  The  re¬ 
duction  in  intracavity  power  reduces  the  magnitude 
of  self-phase  modulation  (both  linear  and  nonlinear 
chirp)  so  that  less  dispersion  compensation  is  re¬ 
quired  from  the  prism  sequence. 
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Fig.  2.  (a)  Spectrum  and  lb)  autocorrelation  of  the  signal 
pulse  for  net  negative  GVD.  The  time-bandwidth  prod¬ 
uct  is  0.45.  (c)  Spectrum  and  <d)  autocorrelation  of  the 
chirped  signal  pulse  for  net  positive  GVD.  The  abrupt 
transition  between  these  two  regimes  suggests  a  self- 
phase- modulation  process  in  the  crystal. 
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Fig.  3.  OPO  signal  (bottom)  and  idler  (top)  spectra  ob¬ 
tained  by  angle  tuning  the  OPO  over  a  range  of  one  set  of 
mirrors.  Broad  tuning  may  also  be  achieved  by  changing 
the  pump  wavelength  without  rotating  the  KTP  crystal  or 
altering  the  OPO  alignment. 

The  insertion  of  the  Brewster-cut  prism  sequence 
reduces  the  output  power  of  the  signal  to  300  mW  in 
the  chirped  regime,  but  we  believe  that  with  a  more 
careful  alignment  full  recovery  of  the  340  mW  is 
possible.  This  loss  is  primarily  due  to  a  small  rota¬ 
tion  of  the  signal  polarization  in  the  KTP  crystal, 
which  is  oriented  slightly  away  from  <t>  »  0°.  The 
output  power  for  the  unchirped  pulses  is  reduced  to 
approximately  180  mW  This  loss  of  power  is  not 
due  to  simple  alignment  since  the  prism  is  only 
translated. 

Tuning  of  the  OPO  is  straightforward  and  may  be 
accomplished  by  three  different  means.  Adjust¬ 
ment  of  the  length  mismatch  of  the  OPO  cavity  re¬ 
sults  in  a  wavelength  shift  as  reported  previously1 
and  may  be  used  to  stabilize  the  OPO  length  at  a 
fixed  wavelength.  The  wavelength  range  over 
which  the  OPO  will  oscillate  while  the  length  is  ad¬ 
justed  is  a  measure  of  how  sensitive  the  OPO  is  to 
length  variations.  The  OPO  can  withstand  a  S-firn 
length  variation,  which  results  in  a  wavelength  shift 
of  almost  50  nm.  Second,  a  change  in  the  pump 
wavelength  will  tune  the  OPO  without  changing  the 
crystal  orientation  or  OPO  alignment  — only  the 
length  of  the  OPO  cavity  must  be  adjusted  to  match 
the  new  pump  cavity  length.  We  can  tune  our 
Ti  sapphire  laser  from  765  to  815  nm  while  main¬ 
taining  mode  locking  and  cavity  alignment.  This 
results  m  tuning  of  the  signal  branch  from  1.22  to 
1  34  dm  and  from  2.05  to  2.08  fim  in  the  idler 
branch.  Note  that  the  wavelength  of  the  idler  re- 
nu»ns  relatively  fixed,  whereas  the  signal  tunes  over 
Itr  nm  the  pump  wavelength  is  varied  over 
oO  nm  Topically  this  type  of  tuning  will  also  result 
in  a  ch,  ige  in  pump  power.  Third,  the  OPO  may  be 
tune  in  the  traditional  manner  by  adjusting  the 
phase-matching  angle  of  the  KTP  crystal.  We  can 
tune  over  a  100-nm  range  by  freely  rotating  the 
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KTP  crystal  and  adjusting  the  cavity  length.  Be¬ 
yond  this  range  the  OPO  alignment  needs  to  be 
modified.  The  operation  of  the  OPO  is  quite  robust 
so  that  broad  tuning  is  accomplished  by  iterating  be¬ 
tween  rotating  the  crystal  and  adjusting  the  OPO 
alignment  while  maintaining  oscillation.  Represen¬ 
tative  spectra  are  displayed  in  Fig.  3  for  both  the 
signal  and  the  idler.  The  demonstrated  tuning  is 
limited  by  the  optics  available  in  our  laboratory,  but 
with  appropriate  optics  the  full  tuning  range  will  be 
accessible. 

No  alignment  of  the  OPO  is  necessary  on  a  day-to- 
day  basis;  length  adjustment  is  all  that  is  required  to 
regain  oscillation.  Furthermore  the  OPO  is  not  ex¬ 
tremely  sensitive  to  pump  steering.  Alignment  of 
the  pump  through  two  pinholes  suffices  to  recover 
oscillation  if  the  Tisapphire  alignment  is  consider¬ 
ably  altered.  The  output  of  the  OPO  is  an  excellent 
TEMoo  mode  that  is  made  possible  by  the  tight 
Z  focus  shown  in  Fig.  1.  Thus  the  OPO  is  a  practi¬ 
cal  laser  source  for  experimental  ultrafast  research. 

A  feedback  circuit  to  maintain  length  matching 
would  be  useful  to  maximize  stability,  although  all 
the  data  presented  in  this  Letter  were  obtained 
without  any  length  stabilization. 

In  summary,  we  have  reported  the  development  of 
a  high-power,  high-repetition-rate  femtosecond 
OPO  externally  pumped  by  a  self- mode-locked 
Tisapphire  laser.  More  than  1.0  W  of  the  pump 
laser  power  is  converted  to  tunable  OPO  radiation 
for  a  conversion  efficiency  of  55%.  Unchirped 
pulses  of  135  fs  can  be  generated  across  the  demon¬ 
strated  tuning  range  of  the  device.  Pulse  shorten¬ 
ing  to  75  fs  is  achieved  by  increasing  the  output 
coupling  at  the  expense  of  output  power. 

This  research  was  supported  by  the  Joint  Service 
Electronics  Program  and  the  National  Science 
Foundation.  We  are  grateful  to  L.  K.  Cheng  and 
J.  D.  Bierlein  of  E.  I.  DuPont  de  Nemours  &  Com¬ 
pany  for  providing  the  KTP  material. 

Note  added  in  proof:  We  recently  generated 
nearly  transform-limited  57-fs  signal  pulses  at  an 
output  power  of  115  mW 
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Using  an  acousto-optically  mode-locked  chromium-doped  forsterite  laser,  operated  at  77  K  and  coupled  to  a 
nonlinear  resonator  containing  a  single-mode  fiber,  we  have  produced  femtosecond  pulses  of  150-fs  duration  at 
1.23  jum  with  useful  output  powers  of  approximately  60  mW  This  represents  what  is  to  our  knowledge  the  first 
demonstration  of  femtosecond  pulse  generation  from  this  laser  system  using  the  coupled-cavity  mode-locking 
scheme. 


The  chromium-doped  forsterite  laser  (Cr:Mg2Si04) 
is  based  on  the  Cr4*  ion  in  a  tetrahedrally  coordi¬ 
nated  lattice  site  serving  as  the  laser-active  center. 
First  demonstrated  by  Petricevic  et  al.,1  the  laser 
emission,  centered  at  1.23  n m,  was  shown  to  be  tun¬ 
able  over  as  broad  a  range  as  from  1.13  to  1.37  /tm.2 
This  feature,  in  conjunction  with  ample  output 
powers,  makes  the  Cr- forsterite  laser  a  useful  source 
for  the  optical  characterization  of  fiber-optic  systems 
at  1.3  pirn  and  spectroscopic  studies  of  narrow  band- 
gap  semiconductors.  To  date,  room-temperature 
Q-switched,1 2  cw,3  flash-lamp-pumped,2  4  cw  acousto- 
optically  mode-locked,5  synchronously  pumped,5  and 
cw  cryogenic6  operations  have  been  demonstrated 
with  various  optical  pumping  mechanisms.  In  par¬ 
ticular,  our  previous  experiments  revealed  an  ap¬ 
proximately  threefold  increase  in  cw  output  power 
when  the  gain  medium  was  cooled  to  77  K  (Ref.  6) 
(for  pump  powers  well  above  threshold),  resulting  in 
cw  output  powers  as  high  as  2.8  W  at  1.23  /im  when 
the  system  was  pumped  by  a  cw  Nd:YAG  laser.7  The 
Cr: forsterite  laser  used  in  the  mode-locking  experi¬ 
ments  described  in  this  Letter  was  also  operated 
cryogenically  to  achieve  increased  power  outputs. 
Furthermore  the  broad  emission  bandwidth  of  this 
laser  can  also  fce  utilized  for  generating  ultrashort 
light  pulses  on  a  femtosecond  scale.  Such  pulses 
are  ideal  for  applications  in  short-pulse  propaga¬ 
tion  experiments  and  femtosecond  time-resolved 
spectroscopy. 

In  this  Letter  we  report  what  is  to  our  knowledge 
the  first  demonstration  of  additive-pulse  mode- 
locked  operation  of  an  actively  mode-locked 
CrTorsterite  laser.  Using  this  technique,  we  have 
produced  pulses  of  150-fs  duration  (FWHM)  at 
1.23  fxm  with  useful  output  powers  of  approximately 
60  mW 

Additive-pulse  mode  locking  (APM),  a  well- 
established  scheme  for  generating  ultrashort  light 
pulses,  has  been  successfully  applied  to  many  solid- 
state  laser  systems  (see  Ref.  8  and  references  therein 
for  a  thorough  discussion).  Briefly,  in  its  most  com¬ 
monly  practiced  form,  this  technique,  also  known  as 
coupled-cavity910  or  interferential11  mode  locking, 
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involves  coupling  the  master  laser  resonator  to  an 
external  nonlinear  cavity  containing  an  optical  fiber. 
The  auxiliary  fiber  cavity,  in  which  propagating 
light  pulses  acquire  a  Kerr-effect-induced  phase 
shift,  can  be  regarded  as  a  nonlinear  termination 
equivalent  to  a  mirror  with  an  intensity-dependent 
reflectivity.  Once  the  nonlinear  phase  shift  is  ad¬ 
justed  to  give  constructive  interference  at  the  center 
and  destructive  interference  in  the  wings  of  the 
master  cavity  and  coupled-cavity  pulses  when  they 
combine  at  the  output  coupler  of  this  composite  opti¬ 
cal  resonator,  a  dramatic  reduction  in  the  output 
pulse  width  results  provided  that  the  two  cavities 
are  interferometrically  matched  in  length.  To 
date,  APM  has  been  demonstrated  in  KChTP,1211 
LiF:F2V2  NaCLOHV4  Thsapphire,1516  Nd  YAG,17 
NdYLF,18  and  Nd:glass19  lasers.  As  described  in 
what  follows,  we  have  applied  this  scheme  to  gener¬ 
ate  femtosecond  pulses  from  the  Cr- forsterite  laser. 

Figure  1  shows  the  experimental  setup  of  the 
coupled-cavity  Cr: forsterite  laser  used  for  the  APM 
experiments.  The  master  cavity,  consisting  of  a 
flat  high  reflector  (HRl),  a  flat  11%  transmit¬ 
ting  output  coupler  (O.C.),  an  acousto-optic  prism 
mode  locker  (M.L.),  and  a  pair  of  5-cm  focal-length 
antireflection-coated  plano-convex  lenses  (LI  and 
L2)  around  the  gain  medium,  was  end  pumped  by  a 
cw  Nd:YAG  laser  (Quantronix  Model  416).  The  gain 
medium  was  a  20  mm  X  5  mm  x  5  mm  piece  of 
forsterite  crystal  cut  along  the  a,  b,  and  c  axes  (using 
the  Pnm,  crystallographic  notation),  with  the  longest 
dimension  along  the  c  axis.  The  estimated  laser- 
active  center  concentration  was  4  X  10 18  cm'3.  To 
prevent  deleterious  etalon  effects,  the  normal-cut 
crystal  was  polished  with  a  slight  wedge  between  the 
5-mm-sided  square  faces,  which  were  also  broad¬ 
band  antireflection  coated  at  1.28  ism.  The  crystal 
was  maintained  in  an  evacuated  Dewar  (pressure 
~10"6  Torr)  at  77  K  with  the  plano-convex  lenses  LI 
and  L2  serving  as  the  Dewar  windows.  Using  5  W 
of  input  pump  power  and  a  70-cm  focal-length 
mode-matching  lens  between  the  pump  and  the 
Cr: forsterite  laser,  we  obtained  625  mW  of  cw 
TEMoo  output  power  with  the  output  field  polarized 
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Fig.  1.  Experimental  setup  of  the  APM  Cr:forsterite 
laser.  The  Cr  forsterite  crystal  was  maintained  at  77  K 
inside  an  evacuated  Dewar  with  lenses  Li  and  L2  serving 
as  the  Dewar  windows. 


Fig.  2.  Oscilloscope  trace  of  the  actively  mode-locked 
pulses  from  the  master  laser  resonator  using  the  11% 
transmitting  output  coupler. 


along  the  a  axis.  The  Cr:forsterite  crystal  had  78% 
absorption  at  the  pump  wavelength  of  1.06  pm  at 
77  K.  An  asymmetric  cavity  configuration  used  to 
prevent  possible  double  pulsing  effects  of  the  mode- 
locked  laser  together  with  a  choice  of  comparably 
shorter  focal-length  lenses  around  the  gain  medium 
resulted  in  less  than  optimum  mode  matching  be¬ 
tween  the  pump  and  the  laser  cavities  and  hence 
lower  output  power  than  what  was  reported  in  Ref.  6. 

Before  employing  the  nonlinear  coupled- cavity 
scheme  to  produce  femtosecond  pulses,  we  actively 
mode  locked  the  master  laser  resonator,  using  a 
quartz  acousto-optic  modulator  in  the  form  of  a 
Brewster-cut  prism  (Crystal  Technology),  placed 
within  3  cm  of  the  output  coupler.  With  approxi¬ 
mately  2  W  of  absorbed  rf  power  at  40.999  MHz,  the 
laser  was  acousto-optically  mode  locked  and  gener¬ 
ated  output  pulses  at  a  82-MHz  repetition  rate. 
The  individual  acousto-optically  mode-locked  pulses 
were  monitored  by  a  high-speed  InGaAs  detector 
with  a  response  time  of  approximately  80  ps  con¬ 
nected  to  a  sampling  oscilloscope  with  a  response 
time  of  less  than  30  ps.  With  a  1%  transmitting  out¬ 
put  coupler,  detector-limited  pulse  widths  of  80  ps 
(FWHM)  were  measured,  indicating  that  the  actual 
pulses  were  shorter  and  comparable  with  what  was 
reported  by  Seas  et  al.s  However,  with  the  11% 
transmitting  output  coupler,  used  in  the  APM  ex¬ 


periments,  the  minimum  pulse  width  obtained  was 
320  ps  (FWHM),  as  shown  in  Fig.  2. 

The  nonlinear  coupled  cavity  was  established  by 
using  an  85%  reflecting  beam  splitter  (B.S.).  The 
single-mode  fiber  (Corning  1521)  of  length  50.8  cm 
placed  in  this  external  cavity  had  zero  group- velocity 
dispersion  at  1.3  ^m  and  a  mode-field  diameter  of 
9  fim.  The  fibc.'  ends  were  cleaved  with  tilt  angles 
of  less  than  0.5  deg  to  the  surface  normal.  Using 
coupling  spheres  (SI  and  S2)  with  antireflection 
coating  on  the  input  side  and  index-matching  gel  be¬ 
tween  the  output  side  and  the  fiber  surface,  together 
with  an  antireflection-coated  mode-matching  lens 
(L3),  we  obtained  coupling  efficiencies  of  approxi¬ 
mately  70%.  A  flat  high  reflector  mirror  (HR2) 
placed  a  distance  from  the  output  end  of  the  fiber 
provided  the  nonlinear  feedback  with  retroreflec- 
tion  efficiencies  approaching  90%.  The  output  of 
the  APM  Crforsterite  laser  was  monitored  by  using 
three  separate  diagnostics:  a  Michelson  interfer¬ 
ometer  with  a  LilOa  nonlinear  crystal  to  measure 
the  collinear  and  background-free  intensity  autocor¬ 
relations  of  the  pulses,  a  Ge  photodiode  with  a  2-ns 
rise  time  to  investigate  the  pulse  train  over  the  50-ns 
to  200-^ts  time  scale,  and  a  scanning  spectrometer 
(Monolight  Model  6000)  to  measure  the  bandwidth 
of  the  output  pulses. 

When  the  two  cavity  lengths  were  interferometri- 
cally  matched,  enhanced  mode-locked  operation  of 
the  Cr:forsterite  laser  was  observed.  Figures  3 
and  4  show  the  background-free  intensity  autocorre¬ 
lation  and  the  spectrum  of  the  APM  pulses,  respec¬ 
tively.  With  the  assumption  of  a  sech2  intensity 
profile,  the  width  (FWHM)  of  the  pulses  was  mea¬ 
sured  to  be  150  fs.  A  simultaneous  measurement  of 
18-nm  bandwidth  gave  a  time-bandwidth  product  of 
approximately  0.55,  roughly  1.7  times  larger  than 
the  theoretical  limit  of  0.32  for  the  assumed  pulse 
shape.  The  output  of  the  laser  was  at  1.23  ^m. 

It  was  found  that  with  the  above  mirror  reflectivi¬ 
ties  and  coupling  retroreflection  efficiencies,  a 
threshold  power  level  of  40  mW  coupled  through  the 


Fig.  3.  Background-free  intensity  autocorrelation  of  the 
APM  Crforsterite  pulses.  The  measured  FWHM  is 
150  fs. 
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Wavelength  (fim) 

Fig.  4.  Spectrum  of  the  APM  CrTorsterite  pulses.  The 
resulting  time-bandwidth  product  is  0.55. 

fiber  was  required  for  observation  of  the  onset  of 
APM  operation.  Close  to  and  somewhat  above  this 
threshold,  the  external  cavity  length  had  to  be  per¬ 
turbed  about  its  correct  value  for  APM  action  to  be 
observed.  When  the  fiber  power  was  increased  to 
well  above  40  mW,  the  pulses  became  more  stable 
and  could  be  sustained  without  the  use  of  active 
cavity-length  stabilization.  By  simultaneously 
monitoring  the  output  using  the  Ge  photodiode,  we 
found  that  near  the  40-mW  threshold  the  APM  out¬ 
put  pulse  train  appeared  as  a  series  of  repetitively 
Q-switched  pulses,  each  of  700-ns  duration  occurring 
at  143-kHz  repetition  rate.  For  power  levels  three 
times  above  the  threshold,  repetitive  Q  switching 
gave  way  to  a  quiet,  stable  pulse  train  with  occa¬ 
sional  weak  relaxation  oscillations.  Similar  turn-on 
behavior  has  been  observed  by  Spielmann  et  al.20 
regarding  the  self-starting  APM  Nd: glass  laser. 
Coupled  fiber  power  levels  of  as  much  as  280  mW 
were  tried,  higher  power  levels  being  avoided  to  pre¬ 
vent  possible  damage  to  fiber  ends.  This  resulted 
in  63  mW  of  useful  output  power  at  1.23  p. m.  Using 
the  relevant  laser  and  fiber  parameters,  we  found 
the  calculated  peak  nonlinear  phase  shift21  between 
the  center  and  the  wings  of  the  external  cavity 
pulses  to  change  from  2.7 v  at  the  40-mW  threshold 
to  19m  when  the  fiber  power  was  280  mW  Within 
the  10%  error  associated  with  the  measurements, 
the  pulse  width  of  the  APM  Cr:forsterite  laser  re¬ 
mained  essentially  insensitive  to  the  variation  in 
this  nonlinear  phase  shift. 

It  was  observed  that  incomplete  mode  locking,  in 
the  form  of  pulses  with  excess  amplitude  noise  or 
spiky  structure,  would  give  rise  to  unsatisfactory 
APM  action,  resulting  in  broader  pulses.  It  was 
therefore  essential  to  get  clean,  spike-free  mode- 
locked  operation  of  the  master  laser  resonator  as  de¬ 
picted  in  Fig.  2  to  obtain  femtosecond  pulses. 

In  conclusion,  we  have  demonstrated,  for  what  is 
to  our  knowledge  the  first  time,  additive-pulse  mode- 
locked  operation  of  the  Cr:forsterite  laser  that  pro¬ 
duces  150-fs  pulses  at  1.23  /xm  with  useful  output 


56 

powers  of  approximately  60  mW.  It  was  also  ob¬ 
served  that  near  the  threshold  of  the  APM  action, 
the  mode-locked  pulse  train  came  as  a  series  of 
repetitively  Q-switched  pulses  of  700-ns  duration 
occurring  at  143-kHz  repetition  rate.  For  power 
levels  sufficiently  above  the  threshold,  however,  a 
stable,  quiet  pulse  train  of  femtosecond  pulses  was 
produced. 
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ABSTRACT 

Carrier  energy  relaxation  tunes  have  been  measured  in  InQ  j^Ga^  grown  by  MBE  on  InP.  Layer  thicknesses  from  0.5  to 
3  microns  have  been  studied.  An  NaCl  color  center  laser  using  additive  pulse  modelocking  supplied  150  femtosecond  pulses 
with  photon  energies  between  780  and  806  meV.  These  were  used  for  time  resolved  optical  saturation  measurements  near  the 
750  meV  material  bandgap.  Carrier  densities  between  0.4  x  10*8  and  5.7  x  10*8  were  achieved.  Lifetimes  of  about  150 
femtoseconds  are  reported.  These  are  observed  to  decrease  with  increasing  carrier  density  and  with  decreasing  photon  energy. 

1.  INTRODUCTION 

The  bandgap  of  the  InGaAs/InP  sytem  at  1.55  microns  has  made  it  a  useful  material  for  optoelectronic  device  fabrication.  In 
addition,  its  high  mobility  suggests  the  possibility  of  fabricating  extremely  fast  devices.  This  has  been  done,  for  example,  in  a 
heterojunction  bipolar  transistor*.  Fast  pin  photodiodes  are  also  being  developed  in  InGaAs.  It  is  therefore  useful  to 
characterize  the  carrier  lifetimes.  Previously,  photoluminescence  upconversion  has  been  applied  to  measure  longer  time  scale 
relaxation  rates^A,  35  wey  35  femtosecond  pump-continuum  probe  methods^.  Both  used  photon  energies  well  above  the 
bandgap.  In  this  work,  near-bandgap  measurements  were  made  well  below  the  intervailey  scattering  threshold,  using  the 
equal  pulse  correlation  technique  to  extract  the  lifetime  from  the  transient  optical  saturation  of  different  samples.  Exploiting 
the  tuneability  of  the  NaCl  color  center  laser,  these  experiments  were  performed  at  several  photon  energies. 

InGaAs  is  a  direct  gap  material,  and  the  bandgap  InGaAs/InP  is  well  known  to  be  750  rtieV  at  300K.  The  next  lowest 
transition  occurs  at  more  than  3  times  die  photon  energy  for  which  the  split-off  band  separation  is  about  343  meV. 
Therefore  only  the  direct  T  transition  to  the  conduction  band  from  the  light-  and  heavy-hole  valence  bands  is  significant  in 
these  experiments;  split-off  holes  cannot  participate  in  the  low-energy  transitions  excited  by  our  laser.  For  this  system,  the 
longitudinal  optical  phonon  energy  is  34  meV. 

Experimental  results  from  3  samples  are  reported  here.  3, 1,  and  0.5um  films  of  Ino.53Gao.47 As  grown  by  MBE  on  an  Fe- 
doped  InP  substrate  were  studied.  Note  that  the  InP  substrate's  bandgap  of  1.4  eV  makes  it  quite  transparent  to  the 
wavelengths  of  the  NaCl  laser.  The  substrate  was,  however,  lightly  polished  to  minimize  scattering  from  the  substrate.  The 
transparency  of  the  InP  was  verified  with  a  Cary  5  spectrophotometer.  For  wavelengths  longer  than  1  micron,  it  revealed  a 
smooth,  resonance-free  transmission  spectrum  for  an  InP  sample  taken  from  the  same  wafer  as  that  used  to  grow  our 
samples.  All  experiments  were  carried  out  at  300K. 

2.  EQUAL  PULSE  CORRELATION  SPECTROSCOPY 

Equal  pulse  correlation  spectroscopy  uses  two  identical  excitation  pulses  derived  from  the  same  source  but  delayed  relative  to 
each  other .  The  time-averaged  absorption  in  the  sample  is  then  a  symmetrical  function  of  delay.  More  precisely,  the 
experiment  measures  the  convolution  of  the  material  response  with  the  second  order  autocorrelation  of  the  laser  pulse**.  This 
assumes  that  the  sample  is  optically  thin,  that  is. 


(L/a) «  1 


(1) 


where  L  is  the  sample  thickness,  and  a  is  the  absorption  depth. 
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For  a  linear  response  function  R(t),  and  a  second-order  pulse  autocorrelation  function  f(t),  the  equal  pulse  correlation  signal 
takes  the  form 

-  s)  +  f(T  +  $)]  +  I  ds  R(s) 

Jo 

where  c(t,  t)  models  the  coherent  response  to  the  rapidly  varying  electric  fields.  For  transform -limited  pulses,  the  coherent 
response  term  is  negligible  for  delays  longer  than  one  and  a  half  pulse  widths.  Most  of  the  useful  information  about  the 
sample  is  contained  within  the  first  term.  Note  that  the  response  is  symmetrical  in  delay  t.  The  nature  of  the  equal  pulse 
correlation  signal  is  illustrated  in  figures  1  and  2  for  100  femtosecond  pulses  convolved  with  50  and  200  femtosecond  decay 
functions. 


[c(s,  X  )  +  c(s,  -x)]  (2) 


S(x)  =  I  ds  R(s)  [f(x 
Jo 


delay  (ps) 


Fig.  1.  The  convolution  of  eqn.  2,  with  the  Fig.  2.  Same  as  Fig.  1,  but  with  50fs 

autocorrelation  function  superimposed.  Pulsewidth  lifetime, 

is  100  fs,  and  decay  time  is  200  fs. 

Typical  pump  probe  spectroscopy  also  gives  a  convolution  of  autocorrelation  and  response  function,  but  information  near 
zero  delay  is  distorted  by  the  order  reversal  of  saturating  and  probing  pulses.  Equal  pulse  correlation  allows  simpler  fitting 
when  the  decay  times  observed  are  close  to  the  excitation  pulse  width. 

3.  Experiment 

The  experiment  is  laid  out  as  a  Michelson  interferometer,  with  one  arm  mounted  on  a  galvanometer-driven  taut  band 
translator^  which  gives  a  smooth  sinusoidal  variation  of  the  optical  delay.  Both  pulses  had  parallel  polarizations.  Only  the 
linear  region  of  the  delay,  near  zero  crossing,  is  used  to  collect  data.  The  spatially  overlapped  pulse  trains  from  each 
Michelson  arm  are  then  attenuated  by  a  rotatable  antireflection  coated  linear  polarizer  followed  by  a  fixed  polarizing 
beamsplitter  used  to  define  a  constant  polarization  state  for  our  experiments.  The  light  is  then  split  into  a  reference  and  signal 
beam.  The  latter  is  focussed  onto  a  germanium  photodiode  for  use  in  noise  suppression.  The  first  beam  is  focussed  to  an  8 
micron  spot  on  the  sample.  The  light  transmitted  through  is  collected  by  a  lens  and  focussed  onto  a  second,  identical 
germanium  photodiode.  Both  detectors  are  preceded  by  neutral  density  filters  to  balance  the  photocurrents  and  also  to 
minimize  detector  nonlinearity. 

The  large  amount  of  amplitude  modulation  (5-10%)  present  on  the  output  of  the  additive  pulse  modelocked  laser  requires  that 
some  form  of  noise  cancelation  be  used.  A  well-known  passive  approach  has  worked  best  to  date:  subtracting  from  the 
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nonlinear  response  signal  a  signal  proportional  to  the  instantaneous  laser  intensity.  This  is  easily  accomplished  using  the 
photodiodes  sampling  optical  intensity  before  and  after  the  sample,  as  described  above.  The  two  photodiodes  are  directly 
connected  so  as  to  subtract  their  photocurrents.  Using  a  variable  neutral  density  filter  to  balance  the  average  photocurrents 
results  in  excellent  subtraction  of  laser  amplitude  fluctuations.  Little  decrease  in  cancellation  efficiency  occurs  near  zero 
delay,  since  the  nonlinearity  is  small,  only  about  2%. 


The  difference  photocurrent  is  used  as  the  input  to  a  transimpedance  amplifier  (Ithaco  model  121 1)  which  provides  a  voltage 
proportional  to  the  difference  current.  It  also  serves  to  filter  out  fast  interferometric  oscillations  in  the  dam  This  voltage  is 
averaged  synchronously  with  the  delay  variation,  using  a  12  bit  a/d  converter.  Five  hundred  to  fifteen  hundred  averages  were 
used  to  obtain  the  traces  presented  here. 

The  source  used  in  these  experiments  was  a  sodium  chloride  color  center  laser  using  additive  pulse  modelocking.  The 
characteristics  of  this  laser  have  been  reported  elsewhere^.  This  laser  was  used  to  produce  100  -  200  femtosecond  pulses 
from  1.54  microns  to  1.59  microns  in  this  experiment.  Up  to  100  miliwatts  of  output  power  is  available,  at  a  repetition  rate  of 
164  MHz.  In  the  wavelengths  reported  in  this  paper,  the  laser  pulses  are  approximately  transform  limited. 


4.  Fitting 


Extracting  the  carrier  lifetime  proved  challenging,  since  the  carriers  clearly  relaxed  on  a  time  scale  comparable  to  the 
pulse  width.  Simply  fitting  the  data  tails  (data  well  separated  from  zero  delay)  to  a  sum  of  exponentials  is  problematic  in  this 
case.  It  was  decided  instead  to  fit  to  the  convolved  model  dexcribed  earlier.  The  dominant  decay  is  clearly  on  the  order  of  150 
femtoseconds  or  less,  so  only  one  exponential  was  used  in  the  fit.  This  is  not  meant  to  imply  that  longer  decays  are  not 
present  -  work  is  still  progressing  on  refining  the  analysis.  The  sodium  chloride  additively  pulse  modelocked  laser  has 
previously  been  demonstrated  to  produce  transform  limited  pulses  near  1.55  micron  wavelengths.  The  second  harmonic 
autocorrelation  trace  of  these  pulses  fit  a  hyperbolic  secant  function  rather  well.  Therefore  the  measured  second  harmonic 
autocorrelation  trace  width  was  used  as  a  fixed  fit  parameter.  All  fitting  was  done  starting  300  femtoseconds  after  the  zero 
delay  point  This  ensured  that  the  coherent  artifact  did  not  distort  the  results.  Good  fits  to  the  data  were  obtained  for  a  ratio  of 
fit  function  peak  to  data  peak  of  1:2.  This  ratio,  which  corresponds  to  the  amount  of  coherent  artifact  present,  was  therefore 
fixed  in  our  analysis  at  this  value.  Data  was  taken  out  to  a  delay  of  1 .7  picoseconds  on  either  side  of  zero  delay. 

5.  Results 

All  of  our  experiments  show  fast  decay  times.  A  representative  equal  pulse  correlation  trace  is  shown  in  figures  3  and  4. 


Delay(fsec) 
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Fig.  3.  A  representative  equal  pulse  autocorrelation  trace 
taken  with  InGaAs,  taken  at  a  wavelength  of  1.596  urn  . 


Fig.  4.  An  expanded  view  of  the 
part  of  the  trace  used  for  fits.  The 
modulation  is  residual  laser  noise. 


Lifetimes  in  the  0.5  micron  and  3  micron  samples  range  between  100  and  200  femtoseconds.  There  is  a  very  clear  decrease  in 
lifetime  with  increasing  carrier  density.  This  is  attributable  to  carrier-carrier  scattering.  There  is  also  a  decrease  in  lifetime 
with  decreasing  photon  energy.  At  the  same  time,  the  slope  of  the  lifetime-carrier  density  curves  decreases  with  decreasing 
photon  energy.  The  lifetime  is  almost  constant  at  120  femtoseconds  for  780  meV  excitation,  the  lowest  photon  energy 
repotted  here.  The  results  are  summarized  in  figures  5  an  !  6  below. 
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Fig.  5.  Convolution  fit  results  for  the  0.5  micron  thick  sample.  The  incident  pulsewidth  was  153  fs,  and  the  excitation 
wavelength  was  1.549  pm. 


carrier  density 


Fig.  6.  Convolution  fits  for  a  1  pm  sample,  with  137  fs  pulsewidths  incident.  The  upper  curve  corresponds  to  an  excitation 
wavelength  of  1.539  pm,  while  the  lower  curve  corresponds  to  a  wavelength  of  1.592  pm. 
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The  3  micron  sample  begins  to  violate  the  assumption  of  an  optically  thin  sample  (the  Beer's  law  absorption  depth  in  InGaAs 
is  2.5  urn).  Somewhat  faster  lifetimes  are  returned  by  the  convolution  fit,  which  may  be  attributable  to  the  sample  acting  like 
a  saturable  absorber  1  * .  These  lifetime  fit  results  are  in  figure  7. 


earner  density 


Fig.  7.  Convolution  fits  for  a  3  pm  thick  sample.  The  closed  circles  are  the  results  for  a  wavelength  of  1.544  pm,  and  the 
open  diamonds  are  the  results  for  a  wavelength  of  1.574  pm. 


6.  Summary 

We  have  optically  measured  carrier  lifetimes  in  In.53Ga.47As  grown  by  MBE  on  InP.  Measurements  were  made  at  several 
photon  energies  just  above  the  bandgap.  Carrier  densities  ranging  from  3  x  106 7 8 * * * * * * * * 17  to  4  x  10 18  were  created  in  the  sample. 
Lifetimes  of  about  150  femtoseconds  were  found,  with  earner-carrier  scattering  appearing  to  increase  with  increasing  carrier 
densities.  Lifetimes  were  found  to  decrease  somewhat  with  decreasing  distance  of  the  photoexcitation  energy  from  the  band 
edge,  and  the  dependence  on  carrier  density  also  decreased. 
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Abstract 

Regeneratively-initiated,  self-sustained,  mode-locked  operation  of  a 
chromium-doped  forsterite  laser  operated  at  3.5  °C  is  described.  By 
employing  intracavity,  negative  group  velocity  dispersion  compensation, 
nearly  transform-limited  femtosecond  pulses  of  48  fsec  (FWHM)  duration 
were  generated  with  average  TEMqq  output  powers  of  380  mW  at  1.23  pm. 
Regenerative-initiation  provides  improvement  in  the  output  stability  and 
ease  of  operation  compared  to  fixed  frequency  AO  modulators.  By  tuning  the 
mode-locked  laser  in  the  range  1.21-1.26  pm,  estimated  values  for  forsterite 
dispersion  constants  have  also  been  obtained  for  the  first  time.  The 
demonstrated  power  and  stability  open  the  door  to  applications  such  as 
efficient  second  harmonic  generation. 
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Among  the  recently  developed  novel  techniques  of  ultrashort  pulse 
generation,  self-mode-locking  has  become  widely  used  and  applied  to  several 
tunable  solid-state  lasers  to  produce  femtosecond  pulses.  First  demonstrated 
in  the  Ti:sapphire  laser  by  Spence  et  «/.[!],  this  scheme  has  been  shown  to 
work  in  other  solid-state  laser  hosts  including  Nd:YLF  [2],  Cr3+:LiSrAlF6  [3], 
chromium-doped  forsterite  (Crforsterite)  [4],  Nd:YAG  [5],  and  Cr3+:LiCaAlF6 
16].  Soliton-type  pulse  shaping  mechanisms,  where  intensity  dependent  Kerr 
nonlinearities  in  the  gain  medium  producing  positively  chirped  pulses  are 
balanced  by  prism  pair  negative  group  velocity  dispersion,  give  rise  to  stable 
femtosecond  pulse  trains  in  these  lasers.  A  variety  of  initiation  techniques 
such  as  continuous-wave  (cw)  self-mode-locking  [1],  regenerative  initiation 
[7-9],  synchronous  pumping  [10],  and  acousto-optical  modulation  [11]  have 
been  used  to  set  the  initial  intensity  conditions  necessary  for  the  soliton-like 
pulse  shaping  to  take  place. 
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The  broad  gain  bandwidth  of  the  Cr:forsterite  laser  makes  it  a  suitable 
candidate  for  the  generation  of  ultrashort  pulses.  To  date,  acousto-optically 
mode-locked  [12],  synchronously  pumped  [12],  acousto-optically  initiated  self¬ 
mode-locked  [4],  and  additive-pulse  mode-locked  [13]  modes  of  operation 
have  been  demonstrated.  Seas  et  al  [4]  reported  the  shortest  pulses  to  date  of 
60  fsec  (FWHM)  duration  using  acousto-optically  initiated  self  mode  locking 
with  intracavity  group  velocity  dispersion  (GVD)  compensation.  They 
reported  that  90  fsec  pulses  were  more  routinely  generated,  suggesting  to  us 
that  some  pulsewidth  instabilities  were  present.  They  reported  only  85  mW 
of  average  output  power. 

In  this  paper,  we  describe  the  performance  of  a  regenera tively  initiated, 
self-sustainable,  mode-locked  Cr:forsterite  laser  operated  at  3.5  °C  that  is 
pumped  by  a  cw  Nd:YAG  laser.  Regenerative  mode-locking  eliminates  the 
need  for  synchronicity  between  the  acousto-optic  modulator  rf  drive  signal 
and  the  cavity  repetition  frequency.  In  our  experience  with  acousto-optic 
mode  locking  of  a  forsterite  laser,  maintaining  this  synchronicity  was 
extremely  critical  for  useful  output.  When  cavity  length  drift  occured,  not 
only  did  the  pulsewidth  increase,  but  large  fluctuations  in  .he  average  power 
were  observed.  Regenerative  initiation  eliminated  these  problems. 
Regenerative  modulation  uses  a  portion  of  the  cavity  beat  signal  to  drive  the 
acousto-optic  modulator  electronics,  thus  obviating  the  need  for  stringent 
cavity  length  control.  It  also  allows  the  in  situ  measurement  of  cavity 
dispersion.  Once  pulse  shaping  is  initiated,  a  very  stable  train  of  femtosecond 
pulses  develops  due  to  the  balance  between  intensity-dependent  Kerr-induced 
nonlinearities  and  the  intracavity  dispersion  of  the  cavity.  As  Seas  et  al  [4] 
demonstrated,  Cnforsterite  is  capable  of  operating  in  this  self-sustained  mode 
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once  the  pulses  are  initiated. 

Unique  to  our  work  is  the  improvement  in  operating  stability  provided  by 
regenerative  initiation,  the  generation  of  significantly  shorter  nearly 
transform-limited  pulses  (48  fsec  FWHM  duration),  and  a  significant  increase 
in  average  TEMqo  output  power  (380  mW  at  1.23  pm).  These  represent  to  our 
knowledge  the  shortest  and  highest  peak  power  pulses  directly  generated 
from  this  laser  system.  Furthermore,  using  the  cavity  dispersion 
measurement  technique  developed  by  Knox  [14],  the  second  and  third  order 
dispersion  constants  in  the  lasing  range  of  the  forsterite  crystal  have  been 
measured  for  the  first  time.  The  combination  of  high  power,  reliable 
operation,  and  cavity  dispersion  measurements  open  the  door  to  shorter 
pulse  generation  and  applications  such  as  efficient  second  harmonic 
generation  of  femtosecond  pulses  in  the  615  nm  region. 

The  experimental  set-up  of  the  regeneratively  initiated  self-mode-locked 
Crforsterite  laser  is  shown  in  figure  1  and  is  similar  to  the  laser  described  in 
reference  4  except  for  the  cavity  length,  crystal  length,  output  coupler,  prism 
seperation,  and  method  of  acousto-optic  initiation.  The  folded,  astigmatically 
compensated  laser  resonator  consisted  of  a  flat  wedged  high  reflector  (M3)  and 
a  3.5  %  transmitting  output  coupler  (O.C)  of  157  cm  radius  of  curvature  with 
the  gain  medium  positioned  slightly  off-center  between  a  pair  of  high 
reflecting  curved  mirrors  (Ml  and  M2)  each  of  5  cm  focal  length  and  separated 
by  10.8  cm.  The  laser  mirrors  were  obtained  from  the  optics  division  of 
Spectra  Physics  Lasers,  Inc.  and  were  broadband  coated  for  operation  between 
1.15  and  1.35  pm.  A  regeneratively  driven  acousto-optic  modulator  (A.O.M) 
was  placed  near  the  output  coupler.  A  pair  of  prisms  (PI  and  P2)  placed  on 
the  high  reflector  side  were  used  for  dispersion  compensation.  The  total 
cavity  length  was  185  cm  corresponding  to  a  longitudinal  mode  spacing  of 
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81.265  MHz.  A  cw  Nd:YAG  laser  (Quantronix  model  416)  operated  at  1.064 
pm  was  mode-matched  and  focussed  into  the  forsterite  crystal  using  an  anti¬ 
reflection  (AR)  coated,  bi-convex  lens  (LI)  of  10  cm  focal  length  through  Ml 
having  93.3%  transmission  at  1.064  pm.  A  half-wave  plate  (W.P.)  at  1.064  pm 
was  used  to  adjust  the  pump  polarization  to  obtain  optimum  power  output 
from  the  laser. 

The  gain  medium,  a  4mm  x  4mm  x  12mm  Brewster  cut  forsterite  crystal 
with  0.3%  chromium  concentration,  was  oriented  with  the  crystal  a-axis  (Pnma 
crystallographic  notation)  in  the  plane  of  incidence  of  a  p-polarized  electric 
field.  The  crystal  was  obtained  from  IFC,  Inc..  The  crystal  was  wrapped  in 
indium  foil  and  tightly  clamped  between  copper  plates  to  facilitate  rapid  heat 
exchange.  A  thermoelectric  cooler  with  a  feedback  loop,  maintained  the 
crystal  temperature  at  3.5  °C  with  peak  temperature  fluctuations  less  than  0.2 
°C.  The  careful  temperature  control  of  the  crystal  was  aruciai  in  obtaining  a 
stable  train  of  femtosecond  pulses.  Temperature  fluctuations  of  a  few  degrees 
gave  rise  to  as  much  as  50%  power  fluctuations  over  100  psec  time  scales 
when  the  laser  was  being  pumped  well  above  threshold.  A  plexiglass 
enclosure  surrounding  the  crystal  holder  assembly  was  purged  with  dry 
nitrogen  gas  to  minimize  water  condensation  on  the  crystal  surfaces.  The 
gain  medium  had  70.9%  absorption  at  1.064  pm  at  the  operating  temperature 
of  3.5  OC. 

With  a  3.5%  transmitting  output  coupler,  6.5W  of  pump  absorbed,  and  a 
crystal  temperature  of  3.5  °C,  the  output  power  of  the  laser  running  in  cw 
mode  (no  prisms,  no  A.O.M)  was  420  mW.  The  output  wavelength  of  the 
laser  was  centered  at  1.23  pm.  The  absorbed  pump  power  slope  efficiency  at 
low  pump  power  levels  was  measured  to  be  10.4%,  the  threshold  pump 
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power  being  1.6  W.  For  absorbed  pump  powers  beyond  5W,  the  slope 
efficiency  started  to  level  off  due  to  increased  thermal  loading  of  the  forsterite 
crystal.  Alignment  of  the  focussing  mirrors  was  critical  to  quiet  operation. 
Beyond  pump  power  levels  of  5W,  the  cw  output  power  sometimes  displayed 
chaotic  power  fluctuations.  We  believe  this  was  due  to  thermal  lensing 
induced  by  the  pump  beam.  The  fluctuations  could  be  fully  overcome  by 
carefully  translating  the  mirror  Ml. 

The  laser  was  first  mode-locked  without  employing  intracavity  dispersion 
compensation.  The  regenerative  mode-locking  scheme  is  similar  to  that 
described  in  reference  8.  The  cavity  loss  was  modulated  using  a  regenerative 
acousto-optic  mode-locker  which  had  0.4  %  modulation  depth  and  a  0.5  W  RF 
amplifier.  The  acousto-optic  modulator(A.O.M)  (NEOS  Technologies,  Inc. 
model  N12040-2-LIT-BR-IN)),  used  a  1  cm  long  Brewster  angled  quartz  crystal 
operated  off  resonance.  Approximately  4%  of  the  laser  output  power  was  sent 
to  an  InGaAs  photodiode  to  produce  a  signal  for  the  regenerative  mode- 
locker  electronics.  Inclusion  of  the  A.  O.  modulator  caused  approximately  6% 
reduction  in  the  total  cw  output  power  of  the  laser.  A  portion  of  the  signal 
from  the  InGaAs  detector  was  also  sent  to  a  Hewlett  Packard  model  5328A  500 
MHz  universal  frequency  counter  to  precisely  register  the  pulse  repetition 
rate.  The  mode-locked  output  of  the  laser  was  analyzed  using  a  scanning 
spectrometer  (Monolight  model  8000)  with  approximately  2.5  nm  wavelength 
resolution  and  an  autocorrelator  with  a  2  mm  thick  LiIC>3  doubling  crystal. 
The  spectrum  and  autocorrelation  signals  were  acquired  using  a  Tektronix 
model  2230  500  MHz  digital  storage  oscilloscope  and  recorded  by  an  interfaced 
computer. 

We  observed  three  distinct  modes  of  operation.  Using  no  intracavity 
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dispersion  compensation,  and  for  cw  output  powers  below  280  mW 
corresponding  to  4.3  W  of  absorbed  pump  power,  41  psec  FWHM  pulses 
(assuming  a  Gaussian  pulse  shape)  were  obtained  from  the  Cnforsterite  laser. 
The  pulse  width  measured  is  in  agreement  with  what  was  previously 
reported  by  Alfano's  group  [12],  and  is  very  close  to  that  predicted  from  active 
mode-locking  theory  for  chirp-free  pulses  [15],  which  was  calculated  to  be  44 
psec. 

Increasing  the  absorbed  pump  power  beyond  4.3  W,  which  increased  the 
output  power  of  the  laser,  resulted  in  pulses  of  6.5  psec  (FWHM)  duration. 
Again  a  Gaussian  pulse  shape  was  assumed.  As  much  as  380  mW  cw  TEMqo 
output  power  at  1.23  pm  was  obtained  while  the  laser  maintained  this  output 
pulse  width.  Due  to  the  limited  resolution  of  the  scanning  spectrometer,  the 
bandwidth  of  the  mode-locked  pulses  could  not  be  fully  resolved.  We  believe 
the  shorter  pulses  at  higher  absorbed  pump  power  are  evidence  of  intracavity 
intensity  induced  nonlinear  effects  (i.e.  self-phase-modulation)  in  the  gain 
medium.  Self-phase-modulation  gives  rise  to  increased  bandwidth  of  the 
pulses  which  can  support  the  shorter  pulse  widths.  Because  no  intracavity 
dispersion  compensation  was  employed,  we  believed  that  these  6.5  psec 
pulses  had  excess  frequency  chirp  and  hence  were  not  transform-limited,  as 
observed  in  [4], 

To  compensate  for  the  positive  second  order  dispersion  in  the  cavity,  a 
pair  of  SF-14  Brewster  angled  prisms  (Pi  and  P2)  were  placed  on  the  high 
reflector  (Ml)  side  of  the  cavity.  The  prism  separation  was  48  cm,  slightly 
longer  than  that  reported  by  Seas  et  al.  [4],  which  is  due  to  the  longer 
Cnforsterite  crystal  used  in  this  work.  Prior  to  observing  femtosecond  pulse 
generation,  the  laser  resonator  was  first  aligned  at  a  low  pump  power  level  to 
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obtain  optimum  cw  output  power.  Subsequently,  the  pump  power  was 
increased  beyond  the  threshold  level  for  self-phase-modulation  (4.3  W)  with 
the  regenerative  mode-locker  operating  to  initiate  the  femtosecond  pulse 
train.  Once  initiated,  the  laser  produced  a  very  stable  uninterrupted  train  of 
femtosecond  pulses.  No  apertures  or  other  means  of  starting  such  as  tapping 
on  the  table  were  necessary.  The  TEMqo  output  power  of  the  laser  was 
380mW  with  the  spectrum  centered  at  1.23  pm.  Figure  2  and  3  show  the 
noncollinear  intensity  autocorrelation  and  the  spectral  width  of  the 
femtosecond  pulses  respectively.  Assuming  a  sech2  [16]  intensity  profile  the 
pulsewidth  (FWHM)  was  measured  to  be  48  fsec.  The  overall  dispersive 
broadening  due  to  the  output  coupler  and  the  autocorrelator  optics  was 
estimated  to  be  less  than  2  fsec  for  this  48  fsec  pulse  at  1.23  pm.  A 
simultaneous  measurement  of  33.7  nm  bandwidth  gave  a  measured  time- 
bandwidth  product  of  0.321  indicating  that  the  pulses  were  nearly  transform- 
limited  and  free  of  excess  frequency  chirp.  We  believe  that  higher  intracavity 
power  levels  (28%higher,  2.77MW)  were  the  predominant  factor  in  obtaining 
pulses  shorter  than  what  was  previously  reported  [4].  With  the  regenerative 
mode-locker  off,  self-sustained  operation  up  to  2  minutes  was  observed. 
Cessation  of  the  mode-locked  operation  was  believed  to  be  due  to 
micromechanical  perturbations  of  the  system.  The  cavity  repetition  rate  was 
stable  to  better  than  40  Hz  and  could  be  varied  by  changing  the  cavity  length 
in  the  range  [81.2300-81.3200  MHz]  without  interrupting  the  mode  locking 
process.  The  peak  output  power  per  pulse  was  determined  to  be  97  kW. 

The  mode-locked  laser  was  tuned  in  the  range  1.211-1.264  pm  by 
translating  a  slit  between  the  prism  P2  and  high  reflector  M3.  Using  the 
frequency  counter,  the  pulse  repetition  rate  was  measured  as  a  function  of 
wavelength.  By  employing  the  cavity  dispersion  calculation  technique 
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developed  by  Knox  [14],  and  by  accounting  for  the  known  dispersion  of  the 
AO  cell  and  the  prism  pair,  the  second  and  third  order  dispersion  constants  of 
forsterite  at  1.23  pm  were  determined  to  be  d2n/dX2=0.047  pm*2  and 
d3n/dX3=-0.339  pm'3  respectively.  The  error  in  these  measurements  was 
estimated  to  be  10  %.  Using  these  numbers,  the  calculated  third  order  phase 
distortion  d3<D/doo3  for  one  cavity  round  trip  was  found  to  be  positive  (-11,000 
fsec3)  and  not  compensated.  We  have  estimated  [17]  that  the  pulses  have  10 
fsec  of  cubic  phase  distortion  and  that  with  cubic  dispersion  minimization 
techniques  [18]  reduction  of  pulse  widths  by  at  least  20%  is  achievable. 

In  conclusion,  we  have  demonstrated  a  regeneratively  initiated  self-mode- 
locked  Cr:forsterite  laser  operated  at  3  5  °C  and  pumped  by  a  cw  Nd:YAG  laser 
at  1.064  pm.  We  have  identified  three  regimes  of  operation  for  this  laser. 
Without  compensating  for  the  cavity  dispersion,  41  psec  and  6.5  psec 
(FWHM)  pulses  with  average  TEMqo  output  powers  of  280  and  380  mW 
respectively  were  produced  at  1.23  pm.  These  modes  of  operation  correspond 
to  active  mode-locking  regimes,  chirp-free  and  chirped,  respectively.  By 
employing  intracavity  GVD  compensation,  a  very  stable  train  of  48  fsec 
(FWHM)  with  average  output  power  of  380  mW  was  generated.  This  regime 
is  similar  to  the  now  common,  self-mode-locked  regime  where  soliton-like 
pulse  shaping  is  important.  Up  to  2  minutes  of  self-sustained  operation  was 
observed.  By  tuning  the  mode-locked  laser,  second  and  third  order  crystal 
dispersion  constants  have  also  been  measured  for  the  first  time.  These 
represent,  to  our  knowledge,  the  shortest,  highest  peak  power  light  pulses 
directly  generated  from  this  laser  system.  These  peak  powers  and  operational 
stability  open  the  door  to  applications  such  as  second  harmonic  generation, 
and  optical  tomography  of  biological  tissues. 


9 


72 


Sennaroglu,  Pollock,  and  Nathel:  Regen.  cw  mode-locked  Cr:forsterite  laser 
Acknowledgments: 


We  would  like  to  thank  Timothy  J.  Carrig  for  helping  with  the  experimental 
set-up  and  David  Cohen  with  the  data  acquisition  system.  Thanks  are  also 
extended  to  Spectra  Physics  Lasers,  Inc.  for  technical  assistance.  This  work  was 
supported  by  the  National  Science  Foundation  under  grant  ECS-91 11838,  the 
Joint  Services  Electronics  Program,  the  Materials  Science  Center  at  Cornell 
University,  and  by  the  U.S.  Department  of  Energy  under  the  auspices  of 
contract  W-7405-Eng-48. 


10 


73 


Sennaroglu,  Pollock,  and  Nathel:  Regen.  cw  mode-locked  Cr:forsterite  laser 
References: 

1.  D.  E.  Spence,  P.  N.  Kean,  and  W.  Sibbett,  Opt.  Lett.  16, 42  (1991). 

2.  G.  P.  A.  Malcolm  and  A.  I.  Ferguson,  Opt.  Lett.  16, 1967  (1991). 

3.  A.  Miller,  P.  LiKamWa,  B.  H.  T.  Chai,  and  E.  W.  Van  Stryland,  Opt.  Lett.  17, 
195  (1992). 

4.  A.  Seas,  V.  Petricevic,  and  R.  R.  Alfano,  Opt.  Lett.  17,  937  (1992). 

5.  K.  X.  Liu,  C.  J.  Flood,  D.  R.  Walker,  and  H.  M.  van  Driel,  Opt.  Lett.  17,  1361 
(1992). 

6.  P.  LiKamWa,  B.  H.  T.  Chai,  and  A.  Miller,  Opt.  Lett.  17, 1438  (1992). 

7.  J.  D.  Kafka,  M.  L.  Watts,  and  T.  Baer,  in  Digest  of  Conference  on  Lasers  and 
Electro-optics  (Optical  Society  of  America,  Washington  D.C.,  1991),  paper 
JMB3;  J.  D.  Kafka,  M.  L.  Watts,  and  T.  Baer,  in  Digest  of  Optical  Society  of 
America  Annual  Meeting,  (Optical  Society  of  America,  Washington  D.  C., 
1991),  paper  TuI2. 

8.  J.  D.  Kafka,  M.  L.  Watts,  and  J.  J.  Pieterse,  IEEE  J.  Quantum  Electron.  28,  2151 
(1992). 

9.  D.  E.  Spence,  J.  M.  Evans,  W.  E.  Sleat,  and  W.  Sibbett,  Opt.  Lett.  16,  1762 
(1991). 

10.  F.  Krausz,  Ch.  Spielmann,  T.  Brabec,  E.  Winter,  and  A.  J.  Schmidt,  Opt. 
Lett.  17,204  (1992). 

11.  P.  F.  Curley  and  A.  I.  Ferguson,  Opt.  Lett.  16, 1016  (1991). 

12.  A.  Seas,  V.  Petricevic,  and  R.  R.  Alfano,  Opt.  Lett.  16, 1668  (1991). 

13.  A.  Sennaroglu,  T.  J.  Carrig,  and  C.  R.  Pollock,  Opt.  Lett.  17, 1216  (1992). 

14.  W.  H.  Knox,  Opt.  Lett.  17,  514  (1992). 


11 


Sermaroglu,  Pollock,  and  Nathel:  Regen.  cw  mode-locked  Criorsterite  laser 


15.  A.  E.  Siegman  and  D.  J.  Kuizenga,  Opto-Electronics.  6,  43  (1974). 

16.  A  sech^  pulse  shape  is  expected  when  soliton-like  pulse  shaping  is 
important. 


17.  R.  L.  Fork,  C.  H.  B.  Cruz,  P.  C.  Becker,  and  C.  V.  Shank,  Opt.  Lett.  12, 
483(1987). 


18.  C.  P.  Huang,  M.  T.  Asaki,  S.  Backus,  M.  M.  Mumane,  H.  C.  Kapteyn,  and  H. 
Nathel,  Opt.  Lett.  17, 1289  (1992). 


12 


75 


Sennaroglu,  Pollock,  and  Nathel:  Regen.  cw  mode-locked  Crforsterite  laser 

Figure  Captions: 

Figure  l:The  schematic  of  the  regeneratively  initiated  cw  mode-locked 
Crforsterite  laser. 

Figure  2:The  noncollinear  intensity  autocorrelation  of  the  regeneratively 
initiated  cw  mode-locked  Crforsterite  femtosecond  pulses  after 
dispersion  compensation.  The  pulsewidth  (FWHM)  is  48  fsec. 

Figure  3:The  spectrum  of  the  regeneratively  initiated  cw  mode-locked 

Crforsterite  femtosecond  pulses  after  dispersion  compensation.  The 
spectral  width  (FWHM)  is  33.7  nm. 
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Modelocked  Cr:forsterite  oscillator 
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X  =  1 .23  jim 
Pabs=  6.0  Watts 
Output  power  =  380  mWatts 
Peak  output  power  per  pulse  =  97  kWatts 
Self-sustaining 
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Abstract 

We  report  on  the  external  second  harmonic  generation  of  a  regeneratively- 
initiated  self-mode-locked  Cnforsterite  laser  in  LiI03  nonlinear  crystal.  Using 
48  fsec  pulses  with  average  power  of  246  mW  at  1.23  pm,  75  fsec  pulses  with 
average  power  of  24  mW  at  615  nm  were  obtained,  giving  conversion 
efficiencies  approaching  10  %.  The  time-bandwidth  product  of  the  red  pulses 
was  measured  to  be  0.77.  The  second  harmonic  pulses  were  tunable  from  605 
nm  to  635  nm,  extending  the  operational  wavelength  range  of  the 
Crforsterite  laser  into  the  visible  portion  of  the  spectrum. 
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Generation  of  tunable  femtosecond  pulses  in  the  red  by  frequency 
doubling  a  mode-locked  Cnforsterite  laser 

Alphan  Sennaroglu  and  Clifford  R.  Pollock 

School  of  Electrical  Engineering,  Cornell  University,  Ithaca  NY  14853 
Tel:(607)-255-5032;  Fax:(607)-254-4565 
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External  second  harmonic  generation  (SHG)  offers  a  simple  scheme  of 
extending  the  operational  wavelength  range  of  a  tunable  laser.  Recently, 
there  has  been  an  unprecedented  growth  in  the  development  of  novel  mode 
locking  techniques  using  tunable  solid-state  lasers.  Broadly  tunable,  high 
peak  power  subpicosecond  pulses  have  been  demonstrated  over  a  large 
portion  of  the  near  IR  region.  Because  such  high  peak  powers  are  essential  to 
achieving  high  conversion  efficiencies  in  nonlinear  processes  such  as  SHG, 
these  mode-locked  tunable  solid-state  lasers  open  the  way  to  efficient 
generation  of  tunable  second  harmonic  pulses. 

In  this  Letter,  we  describe  the  external  doubling  of  a  regeneratively- 
initiated,  self-mode-locked  Cnforsterite  laser  using  a  LiI03  nonlinear  crystal. 
Using  48  fsec(FWHM)  input  pulses  at  1.23  pm  with  average  output  power  of 
246  mW,  75  fsec(FWHM)  pulses  at  615  run  with  conversion  efficiency  of  10  % 
were  obtained.  By  tuning  the  output  of  the  pump  laser  from  1.21  to  1.27  pm, 
the  second  harmonic  output  wavelength  could  be  tuned  in  the  range  605  to 
635  nm. 
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The  regeneratively-initiated,  self-mode-locked  Crforsterite  laser  used  in 
the  SHG  experiment  has  been  described  elsewhere[l],  Briefly,  it  consists  of  a 
folded,  astigmatically  compensated  z-cavity  with  a  3.5  %  transmitting  output 
coupler.  The  gain  medium  is  a  12  mm  long  Brewster-cut  Crforsterite  crystal 
having  0.3  %  chromium  concentration.  The  laser  is  collinearly  pumped  by  a 
continuous-wave  Nd:YAG  laser  operated  at  1.06  |tm.  When  maintained  at  an 
operating  temperature  of  3.5  °C  through  active  cooling,  the  Crforsterite 
crystal  absorbs  70.9  %  of  the  incident  1.06  ^m  pump  power.  The  absorbed 
pump  power  slope  efficiency  of  the  Crforsterite  laser  is  10.4%,  the  threshold 
pump  power  being  1.6  W.  Compensating  for  the  intracavity  positive  group 
velocity  dispersion  (GVD)  by  using  a  pair  of  Brewster-cut  SF-14  prisms 
seperated  by  48  cm,  self  mode  locking  is  initiated  with  a  regeneratively  driven 
acousto-optic  mode-locker  operated  off-resonance.  The  mode-locked 
Crforsterite  laser,  operating  at  a  81.27  MHz  pulse  repetition  rate,  is  capable  of 
delivering  average  powers  as  high  as  380  mW.  The  output  pulsewidth 
(FWHM)  is  48  fsec  at  1.23  pm.  By  translating  a  slit  between  the  the  second 
prism  of  the  GVD  compensation  pair  and  the  cavity  high  reflector,  the  output 
wavelength  of  this  laser  can  be  tuned  in  the  wavelength  region  from  1.21  to 
1.27  pm. 

The  SHG  set-up  used  for  externally  doubling  the  mode-locked  Crforsterite 
laser  is  shown  in  figure  1.  As  the  nonlinear  medium,  a  2  mm  thick  LiI03 
crystal(  Cleveland  Crystals,  Inc.  ),  type-I  phase-matched  at  1.23  pm  was  used. 
In  order  to  prevent  degradation  of  the  surface  quality  of  this  hydroscopic 
crystal,  a  cover  slip(of  thickness  0.2  mm)  with  anti-reflection  (AR)  coating  on 
one  side  was  glued  to  each  crystal  surface  using  uv  curing  epoxy.  The 
nonlinear  crystal  was  mounted  on  a  rotation-tilt  stage  to  accurately  optimize 
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the  second  harmonic  efficiency  while  tuning  the  pump  laser.  The  incident 
Cr:forsterite  laser  beam  was  focussed  to  a  25  pm  diameter  spot  inside  the  Lil03 
crystal  using  a  telescope  arrangement  of  two  5  cm  focal  length  AR  coated 
plano-convex  lenses  (LI  and  L2)  seperated  by  1.5  cm.  The  emerging  beam  was 
recollimated  with  a  broad  band  AR  coated  (450-700  nm)  5  cm  focal  length 
lens(L3).  After  separating  the  second  harmonic  signal  from  the  fundamental 
with  a  dichroic  filter  (Fl)  having  99.9%  reflectivity  at  1.23  pm  and  95% 
transmission  in  the  red,  the  SHG  power  was  measured  with  a  Molectron 
model  5100  power  meter.  Temporal  characteristics  of  the  red  pulses  were 
studied  by  measuring  the  collinear  intensity  autocorrelation  with  a  0.6  mm 
thick  BaB204  (BBO)  crystal  aligned  for  type  I  phase  matching.  The  spectral 
width  of  the  SHG  pulses  was  measured  with  a  0.25  m  monochromator  and  a 
silicon  detector. 

After  careful  alignment  of  the  LiI03  crystal,  24  mW  of  average  power  at  615 
nm  was  obtained  with  246  mW  of  incident  power  at  1.23  pm,  resulting  in 
9.7%  conversion  efficiency.  Figure  2  shows  the  collinear  intensity 
autocorrelation  of  the  SHG  pulses  at  615  nm.  Assuming  a  sech2  intensity 
profile,  the  pulsewidth(FVVHM)  was  measured  to  be  75  fsec.  A  simultaneous 
measurement  of  13  nm  spectral  bandwidth  gave  a  time-bandwidth  product  of 
0.77.  The  red  pulses  could  be  tuned  from  605  to  635  nm  with  the  pulsewidth 
essentially  remaining  the  same. 

The  expected  efficiency  of  second  harmonic  generation  from  LiI03  was 
estimated  by  taking  into  account  the  walk-off  effects  between  the  fundamental 
and  the  second  harmonic  beams,  the  finite  divergence  of  the  fundamental 
beam  and  the  finite  spectral  phase  matching  bandwidth  of  the  crystal. 
Following  the  treatment  of  Boyd  and  Kleinman  [2],  the  amount  of  second 
harmonic  power  P2(0  (in  watts)  generated  from  a  monochromatic  beam  in  a 
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nonlinear  medium  with  the  assumption  of  no  absorption  and  pump 
depletion  can  be  estimated  using  the  equation 


^20)  ® 


167t2Ti0d2Lhm(B^)  p2(X)F(X) 
n3  X3 


0) 


In  (1),  where  all  the  quantities  are  expressed  in  MKS  units,  tjo  is  the  vacuum 
impedance,  d  is  the  effective  nonlinear  coefficient  of  the  medium,  L  is  the 
crystal  length,  n  is  the  crystal  index  of  refraction  and  P(X)  is  the  fundamental 
spectral  power  distribution.  The  dimensionless  factor  hm(B,^),  which  is  a 
function  of  the  normalized  walk-off  parameter  B  and  the  normalized 
focussing  parameter  accounts  for  the  efficiency  limitations  due  to  walk-off 
effects  arising  from  double  refraction  and  the  finite  beam  divergence  (see 
reference  2  for  definitions  of  B  and  £).  One  realizes  that  the  fundamental 
beam  in  this  experiment  is  no  longer  monochromatic  for  48  fsec  pulses  and 
the  effect  of  the  finite  spectral  phase-matching  bandwidth  of  the  LiI03  crystal 
has  to  be  taken  into  account  through  the  efficiency  factor  F(X)  appearing  in  (1) 
defined  according  to 


F(X)  =  sine 


2 


Ak(X)L " 
2 


(2) 


In  (2),  Ak(X)  is  the  wave  vector  mismatch  between  the  fundamental  and  the 

second  harmonic  waves.  By  expressing  P2(X)  as 

P2(X)  =  pgp„(X)  (3) 


where  pn(X)  is  the  normalized  spectral  distribution  function  of  the  squared 
incident  power,  the  effect  of  the  finite  spectral  phase-matching  bandwidth  of 
the  SHG  crystal  on  the  conversion  efficiency  can  be  estimated  by  integrating 
(1)  over  all  wavelengths.  This  simply  replaces  the  function  P2(X)F(X)/X3 
appearing  in  (1)  by  the  spectrally  averaged  value  of  P2(X)/X3  using  F(X)  as  the 
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weighting  factor. 

Calculation  of  various  quantities  appearing  in  (1)  was  done  using  the 
Selimeier  equations  for  LiI03  given  in  reference  3.  Using  the  fact  that  LiI03  is 
a  negative  uniaxial  crystal,  the  type  I  phase  matching  angle  0m  and  the  walk- 
off  angle  p  are  calculated  to  be  25.94°  and  3.67°  respectively.  Furthermore, 
since  LiI03  belongs  to  the  point  group  6,  the  effective  nonlinear  coefficient  d 
given  by  d3isin(0m+p)  is  calculated  to  be  2  pm/V  using  d31=4.1  pm/V  [4].  For 
a  2  mm  thick  crystal  with  index  of  refraction  n=l. 85218  and  focussed  beam 
diameter  of  25  pm  the  parameters  B  and  £  discussed  earlier  evaluate  to  4.4 
and  1.35  giving  hm(B,^)~0.17  [2]. 

By  using  the  Selimeier  equations  and  a  fixed  phase-matching  angle  of 
25.94°,  F(X.)  defined  in  (2)  is  plotted  in  figure  3  for  a  2  mm  long  LiI03  crystal. 
Also  plotted  in  figure  3  is  the  function  p(X.)  (not  normalized)  for  a  sech2  pulse 
of  duration  48  fsec  (FWHM).  By  averaging  pn(X)  using  F(X)  as  the  weighting 
factor,  we  estimated  that  the  finite  spectral  phase-matching  bandwidth  of  the 
crystal  would  cause  approximately  75  %  reduction  in  the  SHG  conversion 
efficiency.  With  this  consideration  in  mind  and  by  substituting  all  the 
relevant  parameters  calculated  above  into  (1),  we  came  up  with  an  expected 
conversion  efficiency  of  approximately  11  %  for  63  kW  peak  power  pulses. 
This  is  in  excellent  agreement  with  the  experimentally  obtained  value  of  10%. 

The  finite  phase  matching  bandwidth  of  the  crystal  is  also  expected  to 
affect  the  temporal  and  spectral  characteristics  of  the  second  harmonic  pulses. 
One  would  ideally  expect  the  second  harmonic  pulsewidth  to  be  0.707  times 
that  of  the  fundamental  pulses.  However,  as  seen  in  figure  3,  the  limited 
phase  matching  bandwidth  of  the  2  mm  thick  LiI03  crystal  will  reduce  the 
bandwidth  available  for  doubling  by  at  least  50  %  resulting  in  second 
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harmonic  pulses  of  about  70  fsec.  In  addition,  GVD  of  LiI03  (d2n/dX2=0.6866 
pm"2  at  615  nm)  together  with  the  GVD  of  approximately  2.2  cm  of  fused 
silica  glass  between  the  SHG  crystal  and  the  autocorrelator  is  expected  to 
further  broaden  these  pulses  to  approximately  85  fsec.  This  is  in  good 
agreement  with  the  75  fsec  (FWHM)  pulses  measured  in  our  experiment. 
The  measured  time-bandwidth  product  of  0.77  also  verifies  that  broadening 
and  possible  spectral  distortion  was  experienced  by  the  SHG  pulses.  Our 
estimations  of  the  expected  pulsewidth  from  the  SHG  process  are  only 
approximate.  More  accurate  numerical  analysis  would  be  required  to  fully 
study  the  combined  effects  of  finite  phase-matching  bandwidth  of  the 
nonlinear  crystal  and  dispersive  effects  on  the  pulsewidth  and  time- 
bandwidth  product. 

In  conclusion,  we  have  demonstrated  efficient  external  doubling  of  48  fsec 
pulses  from  a  mode-locked  Cr:forsterite  laser  using  LiI03  nonlinear  crystal. 
With  246  mW  of  incident  power  at  1.23  pm,  75  fsec  (FWHM)  pulses  with 
conversion  efficiency  of  10  %  were  obtained  at  615  nm.  The  experimentally 
measured  SHG  conversion  efficiency  agreed  well  with  the  expected  value 
which  was  calculated  by  taking  into  account  the  beam  walk-off  effects,  finite 
beam  divergence  of  the  fundamental  beam  and  the  limited  spectral  phase¬ 
matching  bandwidth  of  the  LiI03  crystal.  The  red  pulses  which  were  tunable 
in  the  wavelength  region  from  605  nm  to  635  nm  now  extend  the  operational 
wavelength  range  of  the  Crforsterite  laser  into  the  visible  portion  of  the 
spectrum.  With  the  available  high  peak  powers  from  this  laser  system  it 
should  be  possible  to  use  more  sophisticated  nonlinear  parametric 
amplification  schemes  to  obtain  broader  wavelength  tunabiiity. 
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Figure  Captions: 

Figure  1:  The  experimental  set-up  of  the  externally  doubled  Cr:forsterite  laser. 

Figure  2:  The  collinear  autocorrelation  of  the  SHG  pulses  at  615  nm.  The 
pulsewidth(FWHM)  is  75  fsec. 

Figure  3:  The  plot  of  F(\)  and  p(X)  as  a  function  of  wavelength  (pm)  for  a  2 
mm  thick  LiI03  crystal  phase-matched  at  1.23  pm. 
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Heterojunction  Vertical  FET’s  Revisited:  Potential  for 
225-GHz  Large-Current  Operation 

Steven  R.  Weinzierl  and  J  Peter  Krustus,  Senior  Member 


Abftract—  High-speed  operation  of  submicrometer  Al.Ca.  , 
As/GaAs  unipolar  heterojunction  transistors  is  examined  using 
two-dimensional  time-dependent  self-consistent  ensemble 
Monte  Carlo  simulation.  Careful  device  design  can  significantly 
increase  ballistic  injection  over  the  heterojunction  in  steady 
state  by  eliminating  retarding  gale-induced  space-charge  re¬ 
versal  there.  Design  for  optimal  large-signal  transient  opera¬ 
tion  must  also  avoid  gate-voltage-dependent  ballistic  injection. 
General  design  principles  for  optimizing  hi„h-speed  operation 
are  proposed.  The  resulting  VFET’s  show  cutoff  frequencies  of 
225  GHz  at  large  drain  currents  at  300  K,  with  frequency -in¬ 
dependent  two-port  y  parameters. 


Fig  I  Crt'vv  secuon  of  htieroiunciios  VFET  man  p*- . rv ' . t  t':\c  i.r 
device  1 1  in  Table  h  Invent  vhew  (he  AI.Ga  ,Av  jfaUing  profile  vno 
gale  (Suite  lof  trjnvicm  jruiyviv 


1.  Introduction 

ANDGAP  engineered  unipolar  heierojunction  tran¬ 
sistors  have  long  held  great  promise  for  ultra-high¬ 
speed  operation  (I).  Although  today's  lateral  hctcrostruc- 
ture  devices  are  well  developed,  unipolar  hctcrostmcturc 
devices  with  transport  across  the  hctcrolayers  (vertical 
FET,  VFET)  have  not  lived  up  to  their  expected  perfor¬ 
mance.  Early  preliminary  Monte  Carlo  simulalions  pre¬ 
dicted  idealized  intrinsic  transconductances  of  1250 
mS/mm  and  unity  gain  cutoff  frequencies  of  250  GHz  at 
77  K  [2],  while  fabricated  devices  have  never  surpassed 
transconductances  of  100  mS/mm  1 3 1 .  (41  Three  reasons 
have  motivated  this  study  of  heterostructure  VFET  de¬ 
vices:  l)  to  explain  the  wide  performance  gap  between 
predicted  and  measured  characteristics  of  VFET-typc  de¬ 
vices.  2)  to  establish  guidelines  for  the  optimum  VFET 
device  designs,  and  3)  to  study  carrier  launching  across  a 
heterojunction  (HJ)  in  the  presence  of  lateral  space 
charges  for  the  first  time  using  a  realistic  nonequilibnum 
earner  transport  formulation.  While  specifically  focusing 
on  the  HJ-VFET  the  principles  found  in  this  study  are 
applicable  to  a  number  of  other  devices,  including  the 
vertical  MESFET,  the  permeable  base  transistor  (PBT). 
and  VFET’s  with  a  planar  doped  barrier  launcher. 

II.  Simulation  Method 

A  two-dimensional  self-consistent  time-dependent  en¬ 
semble  Monte  Carlo  particle  formulation  is  used  here  to 
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explore  the  nonequilibnum  transport  processes  desenbed 
above.  This  method  is  a  straightforward  extension  |5)  of 
our  equivalent  onc-dtmcnstonal  formulation  j61  The  full 
knowledge  of  the  microscopic  processes  provided  by  ihe 
Mome  Carlo  method  allows  computaiton  of  all  figures  of 
mens  for  the  intnnsic  device;  no  extnnsic  device  para- 
sines  are  considered  here.  Boih  trarsconductance  f  g„)  and 
gate  capacitance  (Cc)  are  determined  from  their  defini¬ 
tions  using  simulated  steady-state  terminal  current  or  in 
tegrated  charge  data.  The  unity-gam  cutoff  frequency  <  /i 
is  then  computed  via  /,  -  gm/(2nCc)  The  complex  fre¬ 
quency-dependent  srrc'l-signal  y  parameters  are  deter¬ 
mined  directly  from  the  Monte  Carlo  result  via  the  Fou¬ 
rier  decomposition  method  (7).  [8],  i  e..  y,,(w)  = 
F  [^/.(Ol/FlAFTr)).  where  F  denotes  the  Founer  trans 
form.  Al,(t)  the  current  change  at  port  t  in  response  to  the 
voltage  change  AFy(r)  at  port  j. 

III.  Definition  of  Device  Structures 

All  VFET  devices  examined  have  the  same  structure 
derived  from  fabricated  devices,  in  which  current  flows 
in  parallel  fingers  from  the  top  electrode  (source)  down 
through  the  channel  into  the  bottom  electrode  (dram).  The 
channel  current  is  modulated  by  lateral  gate  electrodes 
placed  symmetrically  on  both  s.des.  Only  one  of  these 
fingers  needs  to  be  simulated  and  its  cross  section  is  shown 
in  Fig.  1.  Source  and  drain  contacts  are  assumed  ohmic, 
while  gates  are  Schottky  contacts.  The  heterostructure 
launcher  is  embedded  into  the  source  and  has  a  graded 
AI.Ga,  _  ,As  ramp  and  an  abrupt  heterojunction  toward  the 
channel.  Table  I  shows  the  parameter  sets  for  three  dif 
ferent  devices:  a  fabricated  device  (4|.  a  baseline  device 
(starting  point  for  optimization)  similar  to  the  fabricated 
one.  and  the  fully  optimtz.J  device  designed  for  300  K 
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operation  using  the  genera!  guidelines  given  in  Section 
VII.  The  baseline  device  has  a  350-nm  lateral  width,  and 
two  symmetric  200-nm-long  gate  electrodes  placed  100 
nm  downstream  from  the  HJ.  Source  and  drain  doping  in 
the  baseline  device  is  smaller  than  in  the  fabricated  de¬ 
vices  in  order  to  avoid  degeneracy  and  carrier-carrier 
scattering  in  these  regions.  The  thin  undoped  spacer  re¬ 
gion  at  the  heterojunction  in  the  fabricated  device  was 
dropped  as  it  is  likely  to  be  washed  out  during  materials 
growth.  Channel  and  drain  lengths  are  shorter  than  in  the 
fabricated  device,  as  the  fabricated  channel  length  of  500 
nm  far  exceeds  the  quasi-ballistic  mean  free  path  even  at 
77  K  and  the  long  drain  length  increases  the  series  resis¬ 
tance.  Fourteen  design  variations,  covering  all  significant 
characteristics,  have  been  defined  in  Table  II  The  opti¬ 
mization  occurs  in  two  steps.  First,  the  operation  of  the 
fabricated  device  is  analyzed.  Next  a  new  more  suitable 
baseline  device  is  defined  for  optimization.  Finally  single 
parameter  variations  are  performed  successively  until  the 
optimum  is  reached.  A  fully  statistical  response  study  is 
not  necessary  because  of  the  microscopic  insight  provided 
by  the  Monte  Carlo  method. 

IV.  Correlation  with  Measured  Data 

The  accuracy  of  the  method  was  verified  by  simulating 
a  two-dimensional  cross  section  of  the  fabricated  device 
and  then  comparing  simulated  steady-state  current-volt¬ 
age  ( !-V )  characteristics  with  that  measured  in  the  fabri¬ 
cated  device  (4],  whose  layer  sequence  is  given  in  Table 
I.  It  has  10  parallel  fingers  which  are  each  132  *im  long 
and  350  nm  wide,  with  200-nm-long  gates.  The  gate-to- 
source  spacing  is  100  nm.  The  simulated  steady-state  drain 


current  differed  from  measured  data  at  300  K  by  less  than 
15^  {maximum  global  error),  a  result  obtained  without 
any  adjustable  parameters 

V.  Steady-State  Operation 

The  key  to  understanding  steady-state  operation  of  this 
class  of  devices  is  the  dipole  layer  at  the  heterojunction 
It  was  recently  shown  that  two-dimensional  macroscopic 
current  continuity  in  conjunction  with  the  lateral  space 
charge  induced  by  the  gate  electrodes  controls  the  elec¬ 
tron  injection  conditions  over  the  heterojunction  (9)  Spe¬ 
cifically.  a  dipole  moment  forms  at  the  heterojunction  Its 
magnitude  and  direction  is  dependent  on  the  externally 
applied  gate  voltage.  Usually,  the  dipole  moment  is  di¬ 
rected  so  as  to  retard  ballistic  injection,  which  then  be¬ 
comes  gate-voltage  dependent.  The  baseline  VFET  de¬ 
sign  (device  1 )  demonstrates  this  effect  very  distinctly  Its 
electron  density,  average  electron  drift  velocity .  and  self- 
consistent  conduction  band  edge  in  the  center  of  the  chan¬ 
nel  along  the  direction  of  carrier  flow  are  given  in  Figs. 
2-4.  This  effect  limits  the  performance  of  the  baseline 
device  to  g„  -  312  mS/mm  and /,  =  64  GHz  at  a  drain 
current  density  of  lD  =  5  x  I04  A/cm:.  This  constitutes 
a  negligible  improvement  over  the  GaAs  device  with  no 
embedded  heterojunction  (device  2). 

Channel-limited  transport  in  an  FET  is  forced  by  the 
applied  gate-to-source  voltage  KGJ  via  the  depletion  re¬ 
gions  at  a  location  in  the  channel  where  carrier  densities 
are  low  and  where  carrier  velocities  reach  approximately 
the  saturation  velocity.  Thereforr,  three  different  methods 
for  controlling  channel-limited  transport  were  investi¬ 
gated  by  adjusting  device  parameters  from  their  baseline 
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TABLE  II 

Definition  of  Device  Parameter  Chances  tor  Otimoanon 
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Fig  2.  Steads  -slate  electron  concentration  along  the  center  of  the  channel 
in  the  direction  ol  current  How  for  devices  1 . 4.  and  5  at  300  K  See  Tabic 
II  lor  device  parameters,  K,.,  =  +0.2  V  and  f'm  =  + 1  0  V 

values:  a/  reduction  of  channel  length  to  220  nm  (device 
3),  b)  gate  electrode  placement  symmetrically  around  the 
HJ  ( Los  =  -  100  nm.  device  4),  and  c)  enhanced  channel 
doping  (/Vrh  =  7  x  I0lf>  cm"3,  device  5),  In  each  of  these 
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Fig  3  Stcad>  state  average  electron  drill  vtlovitv  along  the  4 ernes  W  me 
channel  in  the  direction  o!  currrn.  flow  for  devices  I  4  *isa  3  4t  Mr)  k 
V. .  *  +0  2  v  and  I-„.  -  -V  1  0  v 
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Fig  4  Steady-state  self-consistent  T  valley  band  edge  along  ibt  temet  ■! 
the  channel  in  the  direction  of  cumcrn  How  for  devices  1,4  and  5  at  300 
K  See  Table  II  tor  device  parameters  «  +0  2  V  and  t »  ■*  \  <i  \ 


cases  the  reversal  of  the  sign  of  the  dipole  layer  moment 
at  the  HJ  will  be  prevented,  which  is  reflected  in  an  en¬ 
hanced  gm  (Fig.  5).  Because  the  gate  capacitance  Cc  is 
also  affected,  the  cutoff  frequency  /,  may  or  may  not  im¬ 
prove  (Fig.  5).  The  device  with  the  gate  overlapping  the 
source  (device  4)  exhibits  a  substantially  reduced/,  due  to 
increased  Cc.  and  performs  worse  than  the  device  with  no 
launcher  (device  2).  Devices  3  and  5  both  showed  im¬ 
proved  performance,  with  g„’s  of  333  and  418  mS/mm. 
respectively,  and  f's  improved  30%  and  17%  over  the 
baseline  device.  Although  both  exhibit  the  desired  flat- 
band  condition  at  the  HJ  even  in  saturation,  device  3  with 
the  short  channel  still  suffers  from  channel-limited  trans¬ 
port  due  to  insufficient  channel  doping,  and  device  5  with 
the  enhanced  channel  doping  still  has  a  channel  longer 
than  the  quasi-ballistic  mean  free  path.  Thus  the  best 
method  for  preventing  space-charge  reversal  at  the  het¬ 
erojunction  is  to  both  decrease  the  channel  length  and  in¬ 
crease  the  channel  doping  (new  baseline  device,  device  6. 
g„  =  438  mS/mm.  /  =  81  GHz)  This  device  is  taken 
as  the  new  baseline  device. 

The  structure  of  the  HJ  launcher  itself  is  obviously  the 
other  important  factor  controlling  steady-state  operation 
One  expects  that  a  large  conduction  band  offset  at  the  HJ 
launcher  results  in  enhanced  immediate  electron  iransfcr 
into  the  heavier  mass  L  valleys.  This  mechanism  will 
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Fig  5.  Transcorducunce  gm,  gate  capacitance  Ccj-  and  cutoff  frequency 
/  for  devices  1-15  at  300  K  See  Table  II  for  device  parameters  i'Ci  »as 
stepped  from  +0  2  to  +0  5  V  while  was  held  ai  ♦  I  0  V 

make  quasi-ballistic  channel  transport  impossible  because 
transfer  from  the  T  to  L  valley  occurs  between  orthogonal 
quantum  states,  randomizing  ail  components  of  the  mo¬ 
mentum  wave  vector.  In  addition,  quantum-mechanical 
reflection  at  the  HJ  is  increased.  Simulation  results  for 
device  7  with  a  doubled  band  offset  (Hhi  =  266  meV) 
confirm  this  expectation  with  an  /,  of  less  than  half  that 
for  the  device  with  no  launcher  (device  2). 

Although  ballistic  injection  occurs  in  device  8  with  an 
enhanced  channel  doping  (A/Ch  =  2  x  1017  cm-’),  ballis¬ 
tic  transport  in  the  channel  is  prevented  by  the  dominant 
ionized  impurity  scattering  mechanism,  which  has  no 
strong  preference  for  small-angle  scattering  for  electron 
energies  in  the  channel;  a  widened  and  drifted  carrier  dis¬ 
tribution  function  downstream  from  the  HJ  is  produced. 
Current  continuity  together  with  the  large  channel  doping 
actually  forces  launcher-limited  transport  in  this  device, 
as  demonstrated  by  a  pulled-down  band  edge.  This  results 
in  no  improvement  over  the  new  baseline,  device  6.  If  the 
constant  mole  fraction  section  in  the  HJ  launcher  of  the 
new  baseline  device  6  is  left  out  with  everything  else  being 
constant ,  the  resulting  device  10  suffers  from  the  largest 
gate  capacitance.  In  this  case  the  conditions  for  ther¬ 
mionic  emission  are  no  longer  satisfied,  and  T  to  L  valley 
transfer  in  the  channel  is  increased.  This  results  in  no  im¬ 
provement  over  the  new  baseline,  device  6.  Decreasing 
the  drain  length  by  nearly  half  to  150  nm  (device  12),  also 
gives  no  improvement  over  the  new  baseline,  device  6. 

Substantial  improvement  is  realized  by  decreasing  the 
launcher  height  ( Hhi  =  66  meV,  device  9),  increasing 
launcher  doping  ( NL  =  1  x  1018  cm-3,  device  11),  and 
decreasing  the  gate  length  (Lc  =  75  nm,  device  13).  De¬ 
vice  9  with  the  shallower  launcher  gave  a  notable  im¬ 
provement  over  the  baseline  device  6,  primarily  because 
less  T  to  L  valley  transfer  downstream  from  the  hetero- 
junction  enhances  the  transconductance,  while  still  pro¬ 
viding  sufficient  ballistic  injection  at  the  HJ.  Increasing 
launcher  doping  prevents  launcher-limited  transport  as 
evident  from  device  6.  Decreasing  the  gate  length  pri¬ 
marily  reduces  the  gate  capacitance,  while  still  providing 
a  channel  pinch-off  capability.  Moving  this  short  gate  far¬ 
ther  downstream  {LG  -  100  nm,  Lcs  =  +50  nm,  device 


14),  shows  nearly  no  difference,  indicating  that  Lcs  ~ 
+  25  nm  is  sufficient  to  prevent  gate-source  interaction 

VI  Transient  Operation 

The  large  signal  sw  itching  characteristics  are  quantified 
here  via  the  response  of  the  device  to  a  voltage  pulse  ap 
plied  to  the  gate  terminals  while  keeping  the  dram  voltage 
fixed  during  the  transient  A  gate  step  voltage  of  2tl  0>,  ~ 
+  0.3  V  (less  depletion)  with  zero  rise  time  for  a  period 
of  20  48  ps  was  used  w  uh  the  dram  biased  into  saturation 
(Vos  ~  +1.0  V)  This  corresponds  to  an  increase  in  the 
drain  current  of  74  %  for  the  baseline  device  At  this  bias 
point  gm  has  half  ns  maximum  value.  From  the  steady- 
state  operation  principles  discussed  above,  one  expects 
that  devices  with  a  voltage-dependent,  and  hence  current 
dependent,  dipole  layer  at  the  HJ  (with  polarity  reversal) 
will  have  poor  switching  characteristics  with  a  long  pe¬ 
riod  of  damped  charge  and  terminal  current  oscillations 
This  is  confirmed  by  our  simulations.  The  oscillations  are 
driven  by  the  following  two  mechanisms.  First,  the  large- 
signal  transient  settling  time  is  largely  determined  by  the 
current  density,  which  is  substantially  reduced  during  the 
transient  due  to  the  current-dependent  ballistic  injection 
Second,  the  nonlinear  injection  process  at  the  HJ.  and  the 
linear  injection  processes  at  the  ohmic  source  and  dram 
contacts,  are  coupled.  This  coupling  occurs  on  a  time  scale 
on  the  order  of  the  dielectric  relaxation  time  rDJI.  which 
is  about  15  fs  in  the  heavily  doped  source  dram  regions 
Contrary  to  this,  the  overall  current  density  through  the 
device  is  at  best  modulated  on  a  time  scale  related  to  the 
plasma  frequency  wy  (about  100  fs),  and  at  worst  on  the 
time  scale  of  the  source-heterojuncnon  transit  time  rs  M), 
which  is  about  1  ps  for  this  device  size.  Combined  with 
the  current -dependent  injection,  the  presence  of  these  tw  o 
different  natural  time  scales  leads  to  an  out-of-phasc  het- 
erojunction-to-ohmic  contact  feedback,  which  drives  the 
current  oscillations  during  the  transient  (Fig.  6).  Damping 
is  provided  by  the  scattering  mechanisms.  This  is  con¬ 
firmed  by  the  fact  that  the  period  of  oscillations  in  the 
drain  current  transient  in  Fig.  6,  about  320  fs.  exactly 
matches  the  time  dependence  of  the  ballistic  fraction  at 
the  HJ  (Fig.  7).  The  presence  of  the  two  coupled  pro¬ 
cesses  is  manifested  in  the  strong  frequency  dependence 
of  the  transconductance  gm  (real  part  of  v-,  in  Fig  8). 
Also  seen  in  that  figure  is  the  excessive  gate  capacitance 
of  some  of  the  devices  arising  from  the  gate-source  in¬ 
teraction.  For  exampie,  device  4  h3s  a  positive  suscep- 
tance  y2t.  All  devices  showed  a  similar  gate  self-admit¬ 
tance  yn:  the  susceptance  was  capacitive  and  the  con¬ 
ductance  small  because  the  Schottky  gates  allowed  only 
displacement  current  to  flow. 

VII.  Device  Design  Criteria 

Design  criteria  have  been  derived  from  the  steady-state 
and  transient  operation  principles  discussed  above  The 
key  to  balanced  high-speed  and  high-current  operation  is 
held  by  the  dipole  layer  at  the  HJ:  it  should  not  be  a  re- 
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Fig.  6.  Transient  dram  current  la  as  a  function  of  time  for  baseline  and 
optimized  devices  I  and  15.  The  applied  gate  step  voltage  *as  0  3V  into 
less  channel  depletion:  decreased  from  +0.5  to  +0  2  V.  and  k’UJ  « 

+  1.0  V. 


Fig.  7.  Oistnbution  function  for  P-valtey  electrons  10  nm  downstream  from 
the  Hi  launcher  in  the  center  of  the  channel  for  the  baseline  device  I  at 
300  K.  given  at  three  specified  times  after  the  application  of  the  gate  step 
voltage.  Vcs  increased  from  +0  2  to  +0.5  V,  and  F0,  «  1.0  V. 
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Fig.  8.  Small-signal  y  parameters  v, ,  and  y:,  as  a  function  of  frequency 
for  devices  I  .  4.  and  15.  Device  parameters  are  defined  in  Table  11  Sim¬ 
ulated  results  are  given  in  24  4-GHz  increments  between  0  and  147  GHz. 
Vcs  was  increased  from  +0.2  to  +0.5  V  for  20  48  ps  while  Pos  » 
+  1.0  V. 


tarding  one  for  the  desired  operating  conditions  and  it's 
bias  dependence  should  be  as  small  as  possible.  This  can 
be  accomplished  by  following  the  guidelines  below: 

1)  The  channel  length  LCh  should  be  comparable  to  the 
quasi  ballistic  mean  free  path  in  the  channel  ( -  200  nm 
at  300  K  lattice  temperature  for  GaAs).  Then  the  forma¬ 
tion  of  a  voltage-dependent  retarding  dipole  layer  at  the 
heterojunction  is  prevented. 

2)  The  gate  length  Lc  should  be  reduced  until  the  edge 
and  area  gate  capacitance  contributions  become  compa¬ 
rable.  The  gate  edge  should  not  be  close  to  the  heavily 
doped  source,  or  drain,  areas  to  minimize  capacitive  feed¬ 
back.  130-nm  gate  lengths  with  +35-nm  gate-io-source 
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♦  I  I  V  for  the  optimized  device  A  gate  Sc  honks  barrier  hrigM  uf  0 
is  assumed 
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spacings  for  channel  lengths  of  200  nm  can  be  achieved 
{10], 

3)  The  channel  doping  density  VCs  is  set  b>  ihe  tradeoff 
between  competing  demands  its  reduction  is  required  to 
minimize  ionized  impurity  scattering  and  ns  enhancement 
is  required  lo  prevent  channel  limned  transport. 

4)  The  launcher  heighi  Hh,  should  be  sei  by  trading  off 
the  kinetic  energy  increase  at  the  HJ  io  electron  transfer 
into  upper  conduction  band  valleys.  It  has  already  previ¬ 
ously  been  shown  by  Tang  and  Hess  [1  lj  that  a  70  meV 
launcher  height,  which  corresponds  to  .r  =  0  11.  will  re¬ 
sult  in  average  overshoot  velocities  as  large  as  5  x  10 
cm/s  for  300  K. 

5)  The  length  of  the  launcher  Ll  must  be  long  enough 
to  allow  for  a  symmetric  quasi-equilibnum  momentum 
distribution  immediately  upstream  from  the  HJ.  to  satisfy 
conditions  for  thermionic  emission 

6)  The  launcher  doping  SL  must  be  large  enough  io* 
support  overshoot  velocities  in  the  channel  in  order  to  pre-  ■ 
vent  launcher-limited  transport.  Doping  it  the  same  3s  the  ™ 
rest  of  the  source  is  acceptable  for  GaAs. 

7)  The  length  of  the  heavily  doped  drain  region  should® 
be  minimized  to  keep  the  transit  time  short  About  1 50  Hi 
nm  is  needed  in  order  to  thermalize  hot  earners  from  the 

L  valley  into  the  f  valley  before  reaching  the  ohmic  drain 
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contact. 


8)  The  length  of  the  graded  launcher  region  should  not 
be  reduced  to  below  50  nm  to  avoid  quantum-mechanical 
reflection.  The  same  requirement  justifies  the  use  of  theB 
semi-classical  Monte  Carlo  transport  formulation  (5)  ■ 
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VIII.  Optimized  Device  Characteristics 
The  optimized  device  15  simultaneously  displays  ex¬ 
cellent  steady-state  and  transient  charactenstics  for  300  1 
operation.  It  reaches  the  -omputed  intrinsic  /  of  144  GHz 
at  a  dram  current  of  150  kA/cm\  which  is  a  120%  im¬ 
provement  compared  to  the  baseline  design.  The  opti¬ 
mized  device  exhibits  a  peak  /,  of  225  GHz  at  a  four  time  Jj 
larger  current  density  of  4  x  10'  A/cm:  compared  to  J| 
maximum  /,  of  100  GHz  at  I  x  10'  A  /cm:  for  the  base¬ 
line  device  (Fig.  9).  The  large  is  nearly  independent 
of  frequency  and  the  transsusceptance  remains  near  zerct 
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(Fig.  8).  The  gale  step  transient  is  very  short  and  critically 
damped,  so  that  no  oscillatory  drain  current  behavior  is 
seen  (Fig.  6).  The  gate  self-conductance  (Re  y, ,)  is  the 
smallest  of  all  devices  in  Table  II,  due  to  its  short  gate 
length,  and  the  larger  dielectric  relaxation  and  plasma  fre¬ 
quencies  resulting  from  the  higher  channel  carrier  con¬ 
centration.  The  characteristics  of  the  optimized  device  1 5 
have  also  been  examined  for  77  K  operation.  Simulations 
show  that  increased  channel  resistance  due  to  dopant 
freeze-out  is  compensated  by  the  enhanced  overshoot  ve¬ 
locities,  and  bandgap  narrowing.  Effective  mass  changes 
are  minor  effects  and  mutually  compensating.  The 
HJ-VFET  optimized  at  300  K  but  operated  at  77  K  shows 
excellent  temperature-independent  operation,  but  suffers 
from  launcher-limited  transport  (f  —  167  GHz  at  1D  = 
95  kA/cm2).  For  best  performance  at  77  K  the  device 
would  require  reoptimization  following  the  guidelines  in 
Section  VII. 

IX.  Conclusions 

The  insight  provided  by  the  analysis  facilitated  the  es¬ 
tablishment  of  device  design  principles  for  highest  speed 
operation.  Ballistic  electron  injection  and  the  multidimen¬ 
sional  dipole  layer  are  the  key  issues  in  heterojunction 
VFET's  compared  to  conventional  FET's.  Optimized 
AlGaAs/GaAS  VFET's  were  shown  to  reach  cutoff  fre¬ 
quencies  up  to  100  GHz  for  "normar’  current  densities 
below  1  x  I05  A /cm2,  while  peak  cutoff  frequencies  up 
to  225  GHz  are  possible  for  current  densities  as  large  as 
4  x  103  A/cm2.  Fabricated  devices  never  reached  such 
performance  levels  because  their  channel  doping  was  de¬ 
liberately  set  low  in  order  to  reduce  scattering  and  effect 
quasi-ballistic  transport  {12].  As  shown  here,  transport 
became  channel-limited  and  a  retarding  dipole  layer  was 
formed  at  the  heterojunction.  Maximum  measured  trans¬ 
conductances  of  60  mS/mm  did  not  stimulate  any  high- 
frequency  characterization  [4],  Our  work  shows  that 
proper  control  of  ballistic  injection  under  multidimen¬ 
sional  hot-electron  conditions  requires  careful  device  op¬ 
timization,  which  would  be  difficult  to  achieve  without 
the  microscopic  insight  provided  by  accurate  device  sim¬ 
ulation.  Finally,  this  study  shows  that  HJ-VFET's  should 
be  reconsidered  for  applications  for  which  both  high 
speeds  and  the  largest  current  densities  are  required. 
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the  earners  in  (he  drift-diffusion  device  analysis  program  This 
study  suggests  that,  for  shorter  gate  length  MESFET's,  tt  is  pos¬ 
sible  to  obtain  a  reasonably  accurate  simulation  with  a  modified 
drift  diffusion  simulator  such  as  PISCES.  This  results  in  a  much 
faster  computation  than  with  a  Monte  Carlo  simulator,  and  makes 
it  possible  to  use  the  device  parameter  extraction  capabilities  of 
PISCES. 
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Space-Charge  Effects  in  Ballistic  Injection  Across 
Heterojuncti-is 

S.  R.  Weinzierl  and  J.  P.  Krusius 


Abstract — Conditions  under  which  ballistic  injection  across  hetero- 
junctions  is  suppressed  in  unipolar  FET  devices  has  been  examined 
using  two-dimensional  Monte  Carlo  simulation.  Gate-induced  lateral 
space  charges  influence  via  macroscopic  current  continuity  the  dipole 
layer  at  the  heterojunction.  A  retarding  dipole  layer  is  shown  to  result 
in  ballistic  electron  fractions  and  transit  times  comparable  to  those 
found  in  bomojunction  devices.  Guidelines  for  avoiding  the  formation 
of  a  retarding  dipole  layer  are  given. 

Semiconductor  field-effect  devices  utilizing  hot-electron  anodes, 
in  particular  ballistic  injection  across  a  heterojunction,  have  long 
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held  promise  as  high-speed  switches  11]  Overshoot  velocities  arc 
thought  to  arise  downstream  from  the  hctcrojuncUun  as  injected 
electrons  convert  potential  into  kinetic  energy  By  maintaining 
overshoot  velocities  for  hundreds  of  nanometers  into  the  channel, 
these  near-ballistic  electrons  should  provide  a  substantially  reduced 
transit  time.  However,  this  description  of  electron  injection  is 
overly  naive  considering  recent  results  from  one-dimensional  sell 
consistent  Monte  Carlo  simulations  |2]  Temperature,  applied  volt¬ 
age,  and  launcher  height  can  change  the  magnitude  and  direction 
of  the  dipole  held  at  the  heterojunction  and  thus  profoundly  inllu 
ence  the  injection  process  in  laterally  uniform  one-dimcnsiona! 
structures.  The  largest  injected  ballistic  fraction  is  achieved,  when 
near-flatband  conditions  exist  at  the  heterojunction.  We  consider 
here  for  the  first  time  how  two-dimensional  phenomena,  always 
present  in  real  devices,  affect  the  injection  process  Two-dimen¬ 
sional  self-consistent  ensemble  Monte  Carlo  simulation  is  used  to 
show  that  lateral  space  charges  induced  by  gate  electrodes  down¬ 
stream  from  the  heterojunction  can  also  cause  dipole  moment  re 
versal  at  the  heterojunction  and  thus  suppress  ballistic  injection 

The  self-consistent  fully  two-dimensional  ensemble  Monte  Carlo 
method  used  here  has  been  described  elsewhere  13]  While  the  her 
erojunction  space -charge  effect  is  a  generic  one,  a  specific  device 
has  to  be  selected  in  order  to  study  it  quantitatively  A  vertical  FET 
(VFET)  with  a  cross  section  of  800  nm  x  350  nm  (Fig  I ).  iden¬ 
tical  to  a  device  recently  examined  in  a  full  optimization  study ,  has 
been  chosen  here  [4]  The  Al,Ga,^,As  heterojunction  launcher 
consists  of  a  75-nm  region,  in  which  the  A!  mole  fraction  x  in¬ 
creases  linearly  from  0  to  22%,  followed  by  a  100-nm  region  with 
a  constant  mole  fraction  of  22%.  Current  in  the  300-nm-long  chan¬ 
nel  is  controlled  by  two  ideal  200-mn-long  Schottky-bamcr  gate 
electrodes  placed  symmetrically  on  both  sides  10(1  nm  downstream 
from  the  heterojunction. 

The  simulated  electron  concentration  for  negative  applied  gate 
voltages  (more  depletion)  shows  that  a  retarding  dipole  layer  is 
formed  at  the  heterojunction  via  sign  reversal  at  higher  dram  volt¬ 
ages  (Fig.  2).  Note  that  this  result  has  been  obtained  by  fulfilling 
nonequilibrium  transport  equations  and  Poisson’s  equation  without 
simplifying  approximations.  The  observed  phenomenon  can  be  ex¬ 
plained  by  extending  the  one-dimensional  theory  for  fiatb3nd  con¬ 
ditions  {2]  into  two  dimensions.  For  laterally  uniform  one-dimen¬ 
sional  injection,  the  flat  band  at  the  heterojunction  will  prevail  for 
ail  applied  voltages  for  which  the  following  macroscopic  current 
continuity  relation  is  satisfied: 

riAIGiAit'iivj  —  nOi  !'cv.  ■  (1) 

Here  /Jaig»ai  and  nCh  denote  the  actual  carrier  densities  in  the 
launcher  and  the  channel,  and  and  t'Ch  the  average  drift  veloc¬ 
ities  for  injected  electrons  at  the  heterojunction  and  downstream 
from  it.  respectively.  Note  that  the  two  carrier  concentrations  are 
not  solely  determined  by  the  local  doping  densities,  but  also  influ¬ 
enced  by  carrier  spillover  and  transport  effects  Local  doping  den¬ 
sities  provide,  however,  a  good  starting  point  for  estimating  nMCaAs 
and  nCh.  Equation  (1)  is  a  direct  consequence  from  the  local  en¬ 
forcement  of  current  continuity  across  the  heterojunction  As  pre¬ 
viously  shown,  the  average  electron  velocities  in  (I)  will  necessar¬ 
ily  include  the  effects  of  ballistic  electrons  and  quantum-mechanical 
reflection  processes  at  the  heterojunction,  because  they  are  deter¬ 
mined  by  an  integral  over  the  entire  local  electron  distribution 
function. 

In  two  dimensions,  current  continuity  is  no  longer  fulfilled  lo¬ 
cally,  but  over  each  cross  section  of  the  device.  The  throughput  in 
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Fig.  t.  Cross  seccion  of  heterojunction  VFET.  The  AI.Ga,  .  ,As  grading 
profile  (mole  fraction  x  as  function  of  position)  and  applied  voltages  are 
also  shown. 
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Fig.  3  Ballistic  fraction  (percent)  m  center  of  channel  10  nrr,  downstream 
from  hettrojunction,  average  channel  transit  lime  r,-,.  and  injection  effi 
ciency  a  as  a  function  of  gate  voltage  Vc,  with  VM  »  *  J.O  V  and  T  = 
300  K  All  results  are  from  Monte  Carlo  method  without  appropriations 


Fig.  2.  Electron  concentration,  from  Monte  Carlo  method,  as  a  function 
of  position  along  the  center  of  the  channel  for  the  gate  voltages:  Vcs  * 
-0.8  V,  -0.1  V,  and  +0.6  V.  Vos  =  +  1 .0  V.  A  Schottky-barrier  height 
of  0  8  V  was  assumed.  Note  that  carrier  concentrations  in  heavily  doped 
regions  are  lower  than  the  doping  densities  because  of  fully  included  donor 
statistics. 


the  channel  downstream  from  the  heterojunction  will  also  be  lim¬ 
ited  by  the  depletion  regions  modulated  by  the  lateral  gate  elec¬ 
trodes.  One-dimensional  depletion  theory  can  be  used  to  obtain  an 
expression  for  the  widths  of  the  lateral  depletion  regions  AW,  which 
combined  with  the  channel  width  W determine  the  current  through¬ 
put.  Therefore,  the  current  continuity  equation  for  two-dimensions 
reads 

nAIGiAi  £ch  /,  _  2 

*a  ~  vm  \  W  ) 


Here  W  is  the  full  lateral  width  of  the  device,  AW  the  depiction 
width,  (  the  dielectric  constant,  Vb,  the  built-in  potential  of  the 
metal-semiconductor  junction,  Vcs  the  applied  gate-source  volt¬ 
age,  kB  Boltzmann’s  contant,  and  T  the  lattice  temperature.  The 
channel  width  factor  FCh  in  (2)  is  always  less  than  unity. 

The  two  average  velocities  in  (2)  always  satisfy  vCh  5  vmj,  since 
the  upper  limit  corresponds  to  quasi-ballistic  injection  and  trans¬ 
port  for  all  electrons.  In  addition,  «ch  should  be  smaller  than  nA,GlAs 
because  of  heavy  doping  in  the  source  region.  Consequently,  the 
inequality  in  (2)  does  not  usually  holdr  and  therefore  a  retarding 
dipole  layer  will  be  formed.  Fig.  2  illustrates  these  conditions.  The 


effect  of  the  dipole  layer  reversal  on  ballistic  injection  is  dearly 
demonstrated  by  the  Monte  Carlo  results  in  Fig.  3.  which  show  the 
fraction  of  ballistic  electrons  10  nm  downstream  from  the  hetero¬ 
junction  as  a  function  of  ^<75-  The  ballistic  fraction  has  been  sal- 
culated  by  integrating  numerically  over  the  ballistic  peak  in  distri¬ 
bution  function  from  k,  =  4.0  x  10*'  cm" 1  to  infinity  The  ballistic 
fraction  is  reduced  from  25 %  to  zero  as  the  gate  bias  is  dropped 
from  Vcs  =  +0.6  V  to  VCs  *  -0.6  V  (more  depiction).  This  rapid 
drop-off  of  the  ballistic  fraction  as  a  function  of  the  gate  voltage 
exactly  replicates  the  behavior  of  the  channel  width  factor  FCk  for 
the  present  case  with  W  *=  350  nm  Thus  we  have  show  n  that  lateral 
space  charges  control  electron  injection  across  the  heterojunction 
via  the  dipole  layer  reversal  mechanism. 

The  average  transit  time  rCA  of  electrons  across  the  channel, 
computed  directly  as  an  estimator  from  the  self-consistent  Monte 
Carlo  results  (Fig.  3),  is  significantly  reduced  as  the  ballistic  frac¬ 
tion  increases.  This  transit  time  vanes  by  a  factor  of  '0  for  a 
1.5  V  change  in  Vcs.  This  strong  reduction  can  be  explained  by  the 
presence  of  a  larger  number  of  quasi-ballistic  electrons  and  an  in¬ 
creased  heterojunction  injection  efficiency  a  defined  as 


—  I  -  '  / 

Here  NBKt  and  Nfwi  denote  the  number  of  electrons  injected  up¬ 
stream  and  downstream  from  the  heterojunction.  respectively  a 
computed  directly  from  the  Monte  Carlo  results  without  approxi¬ 
mations  is  also  given  in  Fig.  3.  One  observes  a  tradeoff  between 
the  maximum  ballistic  injection  efficiency  and  the  acceptable  gate 
voltage  swing  between  the  open  and  pinched  off  states  of  the  chan¬ 
nel. 

From  the  above  it  is  clear  how  to  avoid  the  reversal  of  the  dipole 
layer  at  the  heterojunction  with  all  its  adverse  consequences  for 
steady-state  and  transient  device  operation.  Two  primary  means  are 
suggested  here  to  help  satisfy  (2):  a)  Place  the  gates  higher  up¬ 
stream  in  the  channel,  but  not  too  close  in  order  not  to  increase  the 
gate-source  capacitance  excessively.  It  may  also  be  helpful  to 
shorten  the  gate  length.  A  gate  placement  closer  to  the  source  would 
shift  the  point  of  minimum  lateral  width  (bottleneck)  into  an  area, 
where  either  n  or  v  is  higher,  b)  Increase  the  channel  doping  to 
boost  the  electron  concentration  in  the  channel,  but  not  too  high  in 
order  not  to  increase  the  ionized  impurity  scattering.  Other  means 
include  a  nonuniform  channel  cross  section  or  nonuniform  channel 
doping,  but  these  are  rather  difficult  to  achieve  in  practice.  Equa¬ 
tion  (2)  also  clearly  explains  why  fabricated  VFET  devices  have 
never  reached  expected  performance  levels  as  measured  by  trans¬ 
conductance  and  cutoff  frequency  [5],  [6],  The  full  multiparameter 
optimization  of  heterojunction  VFET  devices  for  high-speed  and 
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high-current  operation  discussed  in  detail  elsewhere  supports  the 
above  conclusions  [4], 
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ABSTRACT 

The  role  of  free  carrier  screening,  in  the  ultrafast  relaxation  of  optically  excited  carriers,  is  reassessed 
using  the  ensemble  Monte  Carlo  technique.  The  conventional  static  screening  approximation  is 
compared  to  a  new  dynamic  screening  model.  Evolution  of  the  nonequilibrium  dynamic  dielectric 
function  and  its  consequences  for  the  carrier  scattering  are  examined.  It  is  shown  that  dynamic  screening 
results  in  significant  enhancement  of  both  the  carrier-carrier  and  polar  optic  phonon  scattering  rates. 
Relaxation  times  for  the  dynamic  screening  model  are  found  to  be  dramatically  shorter  than  those  for  the 
static  screening  model.  Methods  of  experimentally  differentiating  between  the  two  models  are  proposed. 


I.  INTRODUCTION 

In  the  last  few  years  the  femtosecond  relaxation  of  optically  excited  electron-hole  plasmas  has  received 
considerable  interest.  A  number  of  investigations,  both  experimental  and  simulation, 4'?  have  drawn 
attention  to  carrier-carrier  scattering  as  an  important  mechanism  through  which  the  relaxation  occurs. 
Until  iccently,  carrier-carrier  scattering  has  been  exclusively  modeled  using  a  static  screening  approach. 
Evidence  has  been  accumulating  that  this  approach  may  be  inadequate.  Calculations  show  that  static 
screening  seriously  underestimates  the  carrier-carrier  scattering  rates. 8.9  Further,  recent  experiments 
have  reported  carrier-carrier  scattering  rates  significantly  larger  than  are  possible  within  the  static 
screening  approximation^.  Recently,  a  molecular  dynamics  approach  combining  free  carrier  screening 
and  carrier-carrier  scattering  has  succeeded  in  improving  correlation  with  experiment  2.5,7  jn  this  worjc 
the  effect  of  free  carrier  screening  is  examined  using  an  ensemble  Monte  Carlo  simulation.  A  new  model 
of  free  carrier  screening  has  been  developed  that  fully  includes  both  the  frequency  and  wavelength 
dependence  of  the  free  carrier  dielectric  function.  In  contrast  to  the  molecular  dynamics  approach,  this 
new  model  operates  within  the  traditional  ensemble  Monte  Carlo  method  and  can  be  generalized  to  other 
situations.  In  order  to  investigate  the  role  of  free  carrier  screening  in  the  relaxation  of  these  optically 
excited  electron-hole  plasmas,  simulations  of  femtosecond  optical  pulse-probe  experiments  were 
performed  incorporating  both  this  new  model  and  a  standard  long  wavelength  static  approximation  on 
InO.53Ga0.47As  thin  films.  The  number  of  physical  processes  to  be  considered  has  been  minimized,  and 
the  role  of  free  carrier  screening  emphasized,  by  limiting  the  energy  of  the  exciting  photons  to  within 
100  meV  of  the  band  gap.  This  allows  the  conduction  band  upper  valleys  and  split  off  hole  band  to  be 
neglected.  It  also  maximizes  the  amount  of  free  carrier  screening  for  a  given  carrier  density. 


2.  FREE  CARRIER  SCREENING 

The  models  of  free  carrier  screening  used  in  this  investigation  are  based  on  the  Lindhard  dielectric 
function.  This  formula  has  the  advantages  that  it  fully  accounts  for  both  the  energy  and  wavelength 
dependence  of  the  linear  dielectric  functions,  and  can  be  calculated  for  an  arbitrary  distribution  of  free 
carriers  so  that  no  assumptions  about  the  form  of  the  nonequilibrium  distribution  function  need  to  be 
made.  The  dielectric  function  for  a  system  of  free  carriers  is  given  by  the  Lindhard  formula  as: 
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Where  Eo  is  the  dielectric  constant  of  the  semiconductor  in  the  ground  state.  If  the  static  long  wavelength 
limit  is  taken,  this  is  simplifies  to: 


£(q)  =  eQ- 


qVk/(k) 

q-VkE(k) 


(2) 


This  is  the  starting  point  for  most  current  models  of  free  carrier  screening.  Clearly  this  approximate 
expression  can  not  express  the  full  complexity  of  the  more  accurate  expression  Eqn.  (1).  For  purposes  of 
this  investigation  we  define  the  free  carrier  dielectric  function  as: 


£fc(q.a>) 


£(q.a>) 

£o 


The  relationship  between  the  free  carrier  dielectric  function  and  the  carrier  scattering  rates  is: 


A(k,,k2)  = 


A0(k,,k2) 

krcl 


(3) 


(4) 


where  Xo(k],k2)  is  the  scattering  rate  neglecting  screening  by  free  carriers  and  X(ki,k2)  is  the  scattering 
rate  including  free  carrier  screening.  Thus,  it  is  really  the  inverse  of  the  free  carrier  dielectric  function 
that  is  of  interest  in  this  work. 

The  screening  models  used  in  this  work  are  derived  using  Eqn.  (1)  and  Eqn.  (2),  with  the  simplifications 
that  anisotropy  in  the  carrier  distribution  functions  and  band  structure  are  ignored,  and  an  approximate 
parabolic  band  structure  is  used  in  calculating  the  dielectric  function.  The  resulting  dielectric  function  is 
isotropic  in  momentum  (k)  space.  In  the  Monte  Carlo  simulation  the  dielectric  function  is  recalculated 
self-consistently  from  the  carrier  distribution  functions  after  each  5  fs  time  step.  The  contributions  of  all 
three  carrier  types  (electrons,  heavy  holes,  and  light  holes)  are  included.  The  static  long  wavelength 
model  used  here  is  similar  to  that  proposed  by  Osman  and  Ferry 6. 


3.  FORMULATION  AND  IMPLEMENTATION 

The  relaxation  dynamics  of  the  optically  excited  carriers  is  studied  using  the  ensemble  Monte  Carlo 
approach,  including  electrons  and  holes,  to  simulate  the  evolving  distribution  function,  and  its 
interaction  with  the  optical  field.  The  distribution  function  includes  all  three  momentum  (k)  space 
dimensions  and  one  spatial  dimension  normal  to  the  surface  of  the  thin  film  and  parallel  to  the  photon 
beam.  The  distribution  function  is  assumed  homogeneous  in  the  two  lateral  directions  in  the  plane  of  the 
film.  Inhomogeneities  arising  from  the  optical  excitations  are  fully  included  with  carrier  motion 
governed  by  a  self-consistent  electric  field  (solution  to  Poisson's  equation). 

The  model  of  the  band  structure  includes  the  conduction  band,  and  the  heavy  and  light  hole  bands 
valence  bands  around  the  fundamental  optical  gap  in  the  center  of  the  zone.  The  bands  are  described  by 
a  four  band  k.p  method,  with  perturbative  corrections  from  higher  bands,  as  given  by  Kane^O.  The 
perturbative  terms  are  necessary  to  get  the  correct  sign  for  the  heavy  hole  mass  and  include  band 
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warping.  The  resulting  heavy  hole  band  is  parabolic  but  warped,  while  the  other  two  bands  are  both 
nonparabolic  and  warped.  The  split  off  band  and  upper  conduction  band  valleys  are  neglected  because  of 
the  small  photon  energies  used  in  this  investigation. 

All  important  carrier  scattering  mechanisms  are  included:  carrier-carrier,  phonon,  ionized  impurity  and 
alloy  scattering.  Nonpolar  optical  and  deformation  potential  ohonon  scattering  were  denved  usin|;  the 
method  of  deformation  potential  operators  of  Pikus  and  Bir‘  *  as  applied  to  the  four  band  k.pl-  To 
further  simplify  these  results  the  expression  for  nonpolar  optic  phonon  scattering  was  evaluated  at  the 
band  edge  and  the  effective  deformation  potential  of  Lawactz*  *  was  used  for  the  valence  bands.  Alloy 
scattering  is  included  through  the  elementary  approach  of  Harrison  and  Hauser*4.  Polar  opuc  phonon, 
piezoelectric,  and  ionized  impurity  are  handled  using  the  well  knowm  formulas  including  proper  overlap 
integrals  derived  from  the  k.p  structure.  Both  inter-  and  intra-band  scattering  are  included  in  the  valence 
bands  for  all  single  particle  scattering  processes.  Carrier-earner  scattering  is  treated  following  the 
method  of  Brunetti  et  al*5t  with  the  improvements  suggested  by  Mosko  et  al^,.  and  with  the 
simplification  that  particles  do  not  change  bands.  The  polar  optic  phonon,  ionized  impunty, 
piezoelectric,  and  carrier-carrier  scattering  rates  are  each  self-consistently  screened  using  the  screening 
models  described  above.  Degenerate  statistics  are  used  for  all  scattering  processes  through  the  rejection 
method. 

The  optical  excitation  of  electron-hole  pairs  is  handled  self-consistently.  The  number  of  earners 
generated  at  a  given  time  is  calculated  using  the  instantaneous  value  of  the  earner  distribution  functions 
with  the  pulse  altered  to  reflect  the  absorbed  energy.  Excitation  rates  are  calculated  from  Fermi's  golden 
rule  with  the  momentum  matrix  elements  calculated  from  k.p  theory.  Both  the  anisotropy  of  the  optical 
matrix  elements  and  their  crergy  dependence  are  included  and  reflected  in  the  excited  carrier 
distributions. 


4.  RESULTS 

Simulated  pulse-probe  experiments  were  performed  for  a  0.25  pm  Ino.53Gao.47As  thin  film  using  both 
the  dynamic  and  static  screening  models  discussed  above.  Both  the  excitation  and  probe  pulses  have  a 
secant  squared  intensity  profile  with  100  fs  FWHM  and  a  photon  energy  of  81U  mcV.  This  corresponds 
to  a  combined  carrier  energy  of  60  meV  above  the  optical  gap  of  0.75  eV.  The  intensity  of  the  excitation 
pulse  was  5.0  x  10*3  eV/cm^  and  the  probe  pulse  was  assumed  to  have  negligible  intensity. 

4,1  Free  Carrier  Dielectric  Function 

Figs.  1-4  show  the  reciprocal  of  the  dynamic  free  carrier  dielectric  function  squared  as  extracted  from 
the  simulation  at  0,  100,  200,  and  1000  fs  after  the  initial  excitation.  In  all  four  cases  the  expected 
spectrum  of  plasma  modes  is  evident  to  the  left  of  each  plot  at  high  energies.  The  plasma  frequency  can 
be  seen  to  increase  in  energy  from  0  to  100  fs  due  to  the  increase  in  carrier  density  as  is  expected.  The 
most  interesting  feature  is  the  ridge  extending  diagonally  across  Figs.  1-3.  Tic  ridge  is  found  to 
correspond  to  the  plasma  spectrum  of  the  heavy  holes  taken  alone.  The  size  of  the  ridge  decreases  with 
increasing  delay  until  at  1000  fs  the  dielectric  function  takes  the  form  expected  of  an  equilibrium  carrier 
distribution.  This  feature  is  a  consequence  of  the  highly  nonajuilibrium  heavy  hole  distribution  at  early 
times.  It  results  from  the  heavy  holes  being  excited  initially  into  an  extremely  narrow  region  of 
momentum  (k)  space.  This  results  in  an  unusually  sharp  resonance  with  the  heavy  holes  for  potentials  in 
this  region  of  frequency  and  wavelength.  As  the  heavy  hole  distribution  relaxes  toward  equilibrium,  the 
ridge  shrinks  in  size  and  eventually  disappears  due  to  the  dispersal  of  heavy  holes  in  k  space  and  the 
resulting  broadening  of  the  heavy  hole  resonance.  The  smaller  bump  to  the  left  of  the  main  ridge 
appearing  at  delays  of  100  and  200  fs  has  a  similar  origin.  This  results  from  a  phonon  replica  of  the 
initial  heavy  hole  distribution  due  to  absorption  of  optical  phonons. 
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Fig.  1.  Simulated  nonequilibrium  free  carrier  screening  (lefcl"^)  0  fs  after 
the  initial  excitation. 


Fig.  2.  Simulated  nonequilibrium  free  carrier  screening  (lefcl'2)  100  fs 
after  the  initial  excitation. 
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Fig.  3.  Simulated  nonequilibrium  free  carrier  screening  (l£fcl*2)  200  fs 
after  the  initial  excitation. 
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Fig.  4.  Simulated  nonequilibrium  free  carrier  screening  Opfcl'2)  1000  fs 
after  the  initial  excitation. 
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The  large  heavy  hole  resonance  crosses  the  region  of  energy-momentum  space  corresponding  to  earner- 
carrier  scattenng.  Thus  indicating  camer-carrier  scattering  should  be  significantly  enhanced  at  early 
times.  However,  in  general  screening  decreases  with  increasing  energy.  Thus,  even  if  the  resonance  is 
ignored,  inelastic  scattering  processes  are  over  screened  by  a  static  screening  model,  suppressing  the 
scattering  rates.  This  applies  to  both  carrier-carrier  and  polar  optic  phonon  scattenng.  For  the  examples 
discussed  here,  the  optical  phonon  energy  of  34  meV  is  too  large  for  polar  optic  phonon  scattenng  to 
interact  significantly  with  the  heavy  hole  resonance  and  the  plasma  frequency  is  too  large  for  either  polar 
optic  phonon  or  carrier-carrier  scattering  do  interact  with  the  ordinary  plasma  modes.  Thus,  inelastic 
scattering  rates  are  generally  greater  with  dynamic  screening  than  with  static  screening,  and  at  early 
times  scattering  processes  that  involve  the  transfer  of  small  amounts  of  energy  such  as  canier-camer 
should  be  greatly  enhanced  due  to  the  resonance  with  the  heavy  holes. 

4 2  Carrier  Scattering  Rates 

In  order  to  verify  these  conclusions  the  scattering  rates  for  the  electrons  and  heavy  holes  are  shown  in 
Figs.  5  and  6  for  both  static  and  dynamic  screening.  In  the  case  of  electron-electron,  electron-heavy  hole 
and  electron  light  hole  the  anticipated  enhancement  of  the  carrier-carrier  scattering  rates  at  early  times  is 
evident.  The  decay  of  these  scattering  rates  with  increasing  delay  is  also  closely  correlated  with  the 
decline  in  the  heavy  hole  resonance.  In  contrast,  the  heavy  hole-heavy  hole  scattering  rate  is  actually 
suppressed  at  early  times.  This  is  because  most  heavy  hole-heavy  hole  events  fall  slightly  forward  of  the 
ridge  (towards  larger  k)  in  energy-momentum  space  where  the  screening  is  slightly  greater.  At  later 
times,  all  the  carrier-carrier  scattering  rates  except  electron-heavy  hole  scattering  are  significantly  larger 
that  their  static  counterparts  in  agreement  with  the  conclusion  that  static  screening  generally  over  screens 
inelastic  scattering  events.  The  magnitude  of  the  difference  is  smallest  for  the  heavy  hole-heavy  hole 
case  reflecting  the  large  momentum  transfers  involved  due  the  flatness  of  the  heavy  hole  band.  Thus  this 
process  is  only  weakly  screened  in  both  cases.  Electron-heavy  hole  scattering  is  an  exception  to  the 
general  enhancement  of  carrier-carrier  scattering  since  the  large  differences  in  carrier  mass  make  this 
process  approximately  elastic. 

For  the  polar  optic  phonon-heavy  hole  scattering  rate  the  difference  between  the  static  and  dynamic 
screening  models  is  small.  As  in  the  case  of  heavy  hole-heavy  hole  scattering,  this  is  due  to  the  flatness 
of  the  heavy  hole  band  and  the  resulting  large  momentum  transfers.  In  the  case  of  electron-polar  optic 
phonon  scattering  the  scattering  rate  with  dynamic  screening  is  much  greater  for  times  after  0  fs.  The 
statically  screened  scattering  rate  is  initially  as  large  as  the  dynamic  one  but  is  suppressed  by  carrier 
screening  as  carriers  are  excited  around  0  fs.  Electron-optical  phonon  scattering  is  only  lightly  screened 
by  the  dynamic  dielectric  function  because  of  the  optical  phonons  large  energies. 
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Fig.  5.  Ensemble  averaged  electron  scattering  rates.  Left  static  screening, 
right  dynamic  screening. 


162/SPIE  Vol.  1677(1992 ) 


108 


Fig.  6.  Ensemble  averaged  heavy  hole  scattering  rates.  Left  static 
screening,  right  dynamic  screening. 

43  Distribution  Function 

The  difference  in  scattering  rates  between  the  two  screening  models  has  significant  consequences  for  the 
evolution  of  the  respective  carrier  distribution  functions.  This  is  most  obvious  for  the  electron 
distribution  functions  which  are  shown  in  Figs  7  and  8.  The  effects  of  the  increased  carrier-carrier 
scattering  are  evident  in  the  rapid  washing  out  of  the  initial  excitation  peaks  with  dynamic  screening. 
Also  the  rate  at  which  carriers  transfer  to  the  bottom  of  the  band  is  much  more  rapid.  Clearly  the 
electrons  relax  toward  equilibrium  much  more  rapidly  with  dynamic  screening,  and  carrier-carrier 
scattering  appears  to  play  a  larger  role  even  though  the  dominant  scattering  mechanism  is  polar  optic 
phonon  in  both  cases. 


Fig.  7.  Evolution  of  the  electron  distribution  function  simulated  using 
static  screening. 
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Fig.  8.  Evolution  of  the  electron  distribution  function  simulated  using 
dynamic  screening. 


4.4  Pulse  Probe  Results 


To  examine  the  effect  of  the  screening  model  on 
experimentally  measurable  parameters  the  probe 
transmission  was  calculated  in  each  case  for  a 
simulated  pulse-probe  experiment.  These  results 
are  shown  in  Fig.  9.  The  results  are  precisely  what 
would  be  expected.  The  dynamically  screened 
curve  approaches  equilibrium  much  faster  than  the 
statically  screened  one.  The  peak  transmission  is 
also  lower  for  the  dynamic  case  because  its  larger 
scattering  rates  do  not  allow  as  many  carriers  to 
accumulate  in  the  optically  coupled  regions. 
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In  order  to  obtain  a  numerical  measure  of  the  D**»y  (f») 

difference  in  relaxation  rate  between  the  two  cases 


exponential  fits  was  performed  on  the  tail  of  each 
curve  for  times  between  200  and  700  fs  after  the 
excitation.  The  relaxation  times  obtained  were  145 
fs  for  the  dynamic  screening  and  205  fs  for  static 
screening,  confirming  that  the  dynamically 
screened  carriers  relax  substantially  faster  than 


Fig.  9.  Simulated  probe  transmission 
for  both  the  static  and  dynamic 
screening  models. 


those  that  are  statically  screened.  Such  a  large  difference  in  relaxation  times  should  be  obvious  in 
experiments  and  enable  the  dynamic  screening  model  to  be  verified.  In  Fig.  10  the  relationship  between 
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relaxation  time  and  excitation  pulse  intensity  is  shown,  while  in  Fig.  1 1  the  relaxation  time  is  plotted 
versus  the  photon  energy.  In  both  cases  only  the  indicated  parameter  is  varied,  all  others  remain 
unchanged.  It  is  clear  that  dynamic  screening  produces  faster  relaxation  times  for  all  photon  energies  and 
pulse  intensities.  As  a  function  of  pulse  intensity  both  curves  have  the  same  qualitative  behavior  with 
relaxation  times  gradually  increasing  with  increasing  pulse  intensity.  In  contrast  the  two  curves  differ 
qualitatively  as  a  function  of  photon  energy.  The  static  screening  curve  shows  a  definite  step  for  photon 
energies  about  35  meV  above  the  band  edge.  This  is  absent  for  dynamic  screening.  The  location  of  the 
step  in  energy  corresponds  to  the  first  phonon  threshold  for  electrons  excited  from  the  heavy  hole  band. 
For  photon  energies  greater  than  this  these  electrons  have  sufficient  energy  to  emit  optical  phonons.  The 
existence  of  this  step  is  a  clear  indication  relaxation  is  phonon  dominated  with  static  screening  while  its 
absence  with  dynamic  screening  points  to  the  increased  importance  of  carrier-carrier  scattering.  The 
large  difference  in  relaxation  times  and  the  presence  or  absence  of  this  step  together  provide  a  means  to 
discriminate  experimentally  between  the  two  screening  models. 


Excitation  Pulso  Intensity  (1 0,J  x  eV/cm1) 


Fig.  10  Comparison  of  the  extracted 
relaxation  rates  as  a  function  of 
excitation  pulse  intensity. 


Fig.  1 1.  Comparison  of  the  extracted 
relaxation  rate  as  a  function  of  the 
energy  of  the  exciting  photons. 


5.  CONCLUSIONS 

It  seems  clear  from  the  present  results  that  free  carrier  screening  plays  a  crucial  role  in  the  relaxation  of 
carriers  excited  near  the  band  gap  in  femtosecond  optical  probing.  Static  screening  has  been  found  to 
seriously  underestimate  both  the  carrier-carrier  scattering  rates  and  the  polar  optic  phonon  scattering 
rates,  and  lead  to  significantly  longer  relaxation  times  when  compared  to  dynamic  screening.  It  was 
found  that  the  highly  nonequilibrium  state  of  the  distribution  function  results  in  an  unexpectedly  sharp 
resonance  with  the  heavy  holes  in  the  dynamic  dielectric  function  that  greatly  enhances  certain  carrier- 
carrier  scattering  events  for  times  less  than  400  fs  after  the  initial  excitation.  From  this  we  conclude  that 
static  screening  is  inadequate  for  developing  an  accurate  understanding  of  these  types  of  experiments. 
Work  is  currently  underway,  in  collaboration  with  the  experimental  effort  of  C.  Pollock’s  research  group 
at  Cornell,  to  quantitatively  verify  these  results.  It  seems  clear  that  future  work  in  modeling  these 
processes  must  include  dynamic  screening  for  significant  progress  to  be  made. 
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Band  Renormalization  and  Dynamic  Screening 
in  Near  Band  Gap  Femtosecond  Optical  Probing  of  InGaAs 

J.  E.  Bair,  D.  Cohen,  J.  P.  Krusius,  C.  R.  Pollock 
Cornell  University,  School  of  Electrical  Engineering  and  School  of  Applied  Engineering 

Physics,  Ithaca  New  York 

The  effect  of  band  renormalization  and  dynamic  screening  in  near  band  edge 
femtosecond  optical  probing  of  Ino.53Gao.47As  has  been  investigated. 

Measured  relaxation  times  for  electrons  and  holes  are  on  the  order  of  1 10  fs. 

Simulated  results,  obtained  from  an  ensemble  Monte  Carlo  formulation,  are  in 
excellent  agreement  with  measured  equal  pulse  correlation  data  only  if  both 
processes  are  included.  Band  renormalization  is  found  to  be  roughly  twice  as 
important  as  dynamic  screening  for  these  conditions. 

The  use  of  femtosecond  lasers  for  probing  carrier  scattering  processes  in  compound 
semiconductors  has  become  common  during  the  1980's.  In  this  type  of  experiment  carriers  are  first 
excited  by  an  initial  optical  excitation  pulse  and  then  probed  by  a  second  pulse  after  a  short  time  delay. 
The  transmission  of  the  second  pulse,  or  the  combined  transmission  of  the  two  pulses,  as  a  function  of 
the  delay  is  determined  by  the  relaxation  of  the  excited  carriers.  Numerous  experiments  of  this  type 
have  been  performed  to  date  [1-51,  and  theoretical  analysis’s  attempted  [6-9],  in  order  to  explore  the 
contributions  of  the  fundamental  carrier  scattering  processes  to  the  measured  results.  Nearly  all  of  this 
effort  has  been  for  carriers  excited  far  from  the  band  edge,  and  thus  nrimarily  concerned  with  intervalley 
transfer  rates.  In  this  work  we  investigate  the  femtosecond  carrif  relaxation  in  a  largely  unexplored 
energy  range,  excitation  within  100  meV  of  the  fundemental  band  edge. 

In  the  near  band  gap  regime,  several  processes  traditionally  ignored  are  expected  to  have 
increased  importance.  Significant  among  these  is  band  renormalization.  Despite  the  higher  excitation 
energy,  several  groups  have  observed  behavior  interpreted  as  band  renormalization,  both  on 
femtosecond  [1-3]  and  picosecond  [4]  time  scales.  Also,  it  has  recently  become  clear  that  an  accurate 
treatment  of  carrier-carrier  scattering  including  the  dynamic  free  carrier  dielectric  function  is  essential  in 
analyzing  these  experiments  [5-7].  Several  methods  of  dealing  with  this  problem  have  been  developed 
[2,6-8].  However,  these  methods  either  make  assumptions  about  the  quasi-equilibrium  nature  of  the 
dielectric  function  or  are  difficult  to  generalize  to  inhomogenous  situations. 

In  this  work  we  report  the  first  quantitative  demonstration  of  the  role  of  band  renormalization 
and  dynamic  screening  in  the  initial  femtosecond  relaxation  of  carriers  optically  excited  near  the  band 
gap.  To  this  end  a  new  ensemble  Monte  Carlo  simulation  has  been  developed  which  accounts  for  both 
dynamic  screening  of  all  long  range  carrier  scattering  processes  and  band  renormalization.  The  free 
carrier  dielectric  function  is  obtained  directly  from  the  Lindhard  (RPA)  formula  and  thus  no  assumption 
of  the  quasi-equilibrium  character  of  the  dielectric  function  or  carrier  distributions  is  required.  Both 
dynamic  screening  and  band  renormalization  are  handled  with  simple  extensions  of  standard  Monte 
Carlo  techniques  and  thus  are  easily  generalized  to  inhomogenous  systems. 

The  measurements  were  performed  using  a  tunable  NaCl  color  center  laser  [10]  which  generates 
femtosecond  pulses  with  photon  energies  near  the  band  gap  of  the  chosen  semiconductor 
In0.53Gao.47As.  The  measurements  described  here  were  performed  with  180  fs  full-width-half¬ 
maximum  (FWHM)  sech^-shaped  pulses  of  0.787  eV  photons.  Thus  the  electron-hole  pairs  are  excited 
37  meV  above  the  fundamental  optical  gap  of  0.75  eV.  The  transient  transmission  at  300  K  was 
measured  using  the  equal-pulse  correlation  technique  [11].  The  laser  was  operated  at  a  repetition  rate  of 
164  MHz  while  the  energy  of  each  pulse  was  2.5  x  10^  eV/cm^.  A  typical  experimental  result  from  a 
1.0  jim  thick  InQ.53Gao.47As  film  on  a  transparent  InP  substrate  is  shown  in  fig.  1. 
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The  measured  results  have  been  analyzed  using  an  ensemble  Monte  Carlo  particle  simulation 
technique,  parts  of  which  have  been  described  elsewhere  [12].  It  includes  electrons  and  holes  from  the 
conduction,  heavy  hole  and  light  hole  bands  with  provisions  for  band  warping  and  nonparabolicity.  All 
important  scattering  processes  are  accounted  for  including  carrier-carrier  scattering.  Materials 
parameters  for  Ino.53Gao.47As  have  been  obtained  from  Ref.  [13].  All  carrier  processes,  including 
carrier  scattering,  optical  excitation  and  one  dimensional  carrier  motion,  are  described  self-consistently 
using  the  instantaneous  values  of  the  carrier  distribution  function,  optical  field,  electric  field  and  free 
carrier  dielectric  function.  Contrary  to  what  has  been  done  before,  the  inclusion  of  one  dimensional 
carrier  transport  allows  any  sample  thickness  dependence  to  be  fully  accounted  for. 

Unique  features  of  the  present  simulation  method  are  the  inclusion  of  a  dynamic  free  carrier 
dielectric  function  and  band  renormalization.  Rather  than  introduce  dynamic  screening  through  a 
molecular  dynamics  approach  [7],  as  several  other  groups  have  done,  we  chose  to  develop  a  method 
within  the  standard  Monte  Carlo  framework  that  does  not  compromise  flexibility.  This  method  is  based 
on  a  direct  evaluation  of  the  Lindhard  (RPA)  dielectric  function  given  by 


£(q,fc>)  =  £0  + 


47rg2iimY  /«(k)-/»(k+q) 

q2  a-*°^'^(k  +  q)-£,(k)-h£U  +  /ha 


(1) 


where /n(k)  denotes  the  carrier  distribution  function,  £n(k)  the  carrier  energy,  and  £0  the  static 
dielectric  constant.  The  sum  is  taken  over  all  crystal  momentum  states  k  and  all  bands  n.  This  equation 
is  evaluated  and  tabulated  at  the  beginning  of  each  Monte  Carlo  time  step  and  the  results  are  used  in  the 
computation  of  carrier  scattering  for  that  time  step.  To  reduce  the  computational  work  several 
simplifying  assumptions  are  made.  The  anisotropy  of  the  carrier  distribution  functions,  dielectric 
function,  and  band  structure  is  neglected,  and  the  band  structure  is  taken  to  be  both  parabolic  and 
spherical  for  the  purpose  of  calculating  the  free  earner  dielectric  function.  We  have  also  run  simulations 
using  a  static  screening  model  for  comparison  [6], 

Band  renormalization  is  included  within  the  “quasi-static”  approximation  developed  by  Haug 
and  Schmitt-Rink  [14].  Within  this  approximation  the  energy  shift  experienced  by  a  state  with  a  crystal 
momentum  k  in  a  single  uncoupled  band  is  given  by 


£,.(k)=£“(k)+Xf\k) 

X“(k)  =  -^£vs(k-k')/,(k')  ,  (2) 

V  k' 

£fA(k)=-^X[V(k)/£o-v‘<k) 

zv  k 


where  ^^(k)  and  ^^(k)  are  the  screened  exchange  and  coulomb  hole  contributions  to  the 

electron  self-energy,  Vs(k)  and  V(k)  the  statically  screened  and  unscreened  coulomb  potentials 
respectively,  V  the  volume  of  the  crystal.  In  the  present  simulations  this  expression  is  generalized  to 
include  the  effects  of  coupling  between  the  conduction  and  the  two  valence  bands,  and  the  overlap 
integrals  between  the  Bloch  states.  This  has  been  implemented  assuming  a  rigid  band  shift  using  the 

value  calculated  for  the  f  point  in  each  band. 
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This  selfconsisient  Monte  Carlo  technique  was  used  to  simulate  the  optical  probe  experiments. 
Fig.  1.  shows  a  comparison  of  both  a  simulated  equal-pulse  experiment  and  a  simulated  pulse-probe 
experiment  with  the  actual  measured  equal  pulse  correlation  data.  Since  the  experiment  does  not 
presently  give  the  absolute  transmission,  all  results  are  presented  in  a  normalized  fashion.  The  fit 
between  the  simulated  and  measured  equal-pulse  curves  is  excellent  for  delays  longer  than  150  fs.  A 
large  coherent  artifact,  evident  for  shorter  delays,  is  as  expected,  since  the  experiment  was  performed 
with  both  pulses  having  the  same  polarization  with  a  280  fs  FWHM  pulse  autocorrelation.  The  simulated 
pulse-probe  curve  also  fits  well  in  the  range  between  150  -  400  fs  but  flattens  out  somewhat  too  rapidly 
for  longer  delays.  This  is  almost  certainly  due  to  subtle  differences  between  the  two  types  of 
experiments.  From  the  excellent  correlation  between  measured  and  simulated  results  we  conclude  that 
the  present  model  provides  a  firm  basis  for  understanding  femtosecond  optical  probing  in  the  near  band 
gap  regime. 

In  order  to  determine  the  role  of  free  carrier  screening  and  band  renormalization  in  near  band 
gap  femtosecond  optical  probing  additional  simulations  of  pulse-probe  experiments  were  performed  in 
which  each  of  the  processes  was  turned  off.  From  the  results  shown  in  Fig.  2  it  is  obvious  that  band 
renormalization  is  responsible  for  a  significantly  reduced  probe  transmission.  It  is  further  clear  that  the 
static  screening  shows  a  slower  recovery  than  dynamic  screening.  To  characterize  the  overall  relaxation 
results  and  to  extract  the  relative  importance  of  the  two  processes,  exponential  fits  were  performed  for 
the  range  200-  800  fs  for  each  curve  in  Figs.  1  and  2.  The  resulting  "effective"  relaxation  times  are 
given  in  Table.  I.  It  is  clear  from  these  results  that  both  processes  have  strong  effects  that  are 
indispensable  in  analyzing  such  experiments.  It  is  somewhat  surprising  that  band  renormalization  is  by 
far  the  most  important  of  these  two  effects  for  these  conditions. 

The  significance  of  band  renormalization  becomes  more  clear,  if  the  simulated  magnitude  and 
time  dependence  of  the  band  shifts  are  examined  (Fig.  3).  The  maximum  reduction  of  the  band  gap  is 
approximately  14  meV,  which  is  a  large  fraction  of  the  initial  excess  carrier  energy  of  37  meV.  This 
results  in  dramatic  differences  in  the  form  of  the  excited  carrier  distribution  functions  due  to  the  large 
renormalization  of  the  bands  during  excitation.  There  is  a  small  recovery  in  the  band  gap  for  delays  less 
than  500  fs,  which  results  primarily  from  the  warming  of  the  very  cold  heavy  holes.  This  introduces 
additional  transient  effects  during  the  relaxation  due  to  changes  in  the  effective  photon  energy  of  the 
probe. 


The  effect  of  dynamic  screening  on  the  carrier-carrier  scattering  rates,  computed  with  dynamic 
screening  and  static  screening,  is  shown  in  fig.  4.  Band  renormalization  has  been  included  in  both 
cases.  Significant  increases  in  all  carrier-carrier  scattering  rates  are  observed  for  dynamic  screening.  The 
electron-electron  scattering  rate  is  most  effected  by  screening.  This  is  especially  true  at  early  times  due 
to  the  highly  non-equilibrium  nature  of  the  free  carrier  dielectric  function.  This  has  been  discussed 
briefly  in  another  publication  [12]  and  will  be  examined  in  detail  in  a  future  publication.  The  heavy  hole- 
heavy  hole  and  heavy  hole-electron  scattering  rates  are  less  effected  because  of  the  larger  momentum 
and  smaller  energy  transfers  involved.  The  electron-polar  optic  phonon  scattering  rates  are  also 
increased  by  about  30%  for  dynamic  screening.  These  results  explain  why  dynamic  screening  had  such 
a  large  effect  on  the  observed  carrier  relaxation  processes. 

In  conclusion,  band  renormalization  and  dynamic  screening  significantly  affect  near  band  gap 
femtosecond  optical  probing.  Both  effects  significantly  reduce  measured  "effective"  relaxation  times 
with  band  renormalization  being  by  the  farther  the  more  important  of  the  two.  Dynamic  screening 
markedly  increases  the  important  carrier  scattering  rates,  while  band  renormalization  results  in  changes 
in  the  distribution  of  carriers  within  the  bands  and  changes  the  effective  probe  energy  with  respect  to  the 
band  edge.  By  including  these  processes  in  our  novel  Monte  Carlo  technique  we  have  succeeded  in 
successfully  reproducing  measured  data.  This  work  demonstrates  for  the  first  time  quantitatively  both 
the  importance  and  the  role  of  renormalization  and  dynamic  screening  in  near  band  gap  femtosecond 
probing. 
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Fig.  1.  Measured  and  simulated  normalized  optical  transmission  as  a  function  of  the  delay  between 
exctitation  and  probe  pulses. 

Fig.  2.  Simulated  optical  transmission  as  a  function  of  the  delay  between  excitation  for  pulse-probe 
configuration.  Curves  with/without  static  and  dynamic  screening  and  band  renormalization  have  been 
labeled  accordingly. 

Fig.  3.  Simulated  rigid  band  shifts  due  to  band  renormalization  as  a  function  of  probe  delay. 

Fig.  4.  Simulated  carrier-carrier  scattering  rates  as  a  function  of  probe  delay,  hh-hh,  e-e,  and  e-hh  denote 
heavy  hole  -  heavy  hole,  electron  -  electron,  electron  -  heavy  hole  scattering  respectively.  Static  and 
dynamic  refer  to  static  and  dynamic  screening. 

Table  I.  Effective  relaxation  times  calculated  by  fitting  a  single  exponential  to  the  measured  and 
simulated  tranmission  over  the  range  probe  delay  of  200-800  fs  . 
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Abstract 

Householder  transformations  applied  from  the  left  are  generally  used  to  zero  a 
contiguous  sequence  of  entries  in  a  column  of  a  matrix  A.  Our  purpose  in  this  paper  is  to 
introduce  new  row  Householder  and  row  hyperbolic  Householder  transformations  which 
are  also  applied  from  the  left,  but  now  zero  a  contiguous  sequence  of  entries  in  a  row 
of  A.  We  then  show  how  these  row  Householder  transformations  can  be  used  to  design 
efficient  sliding  data  window  recursive  least  squares  covariance  algorithms,  which  are 
based  upon  rank-/:  modifications  to  the  inverse  Cholesky  factor,  R-1,  of  the  covariance 
matrix.  The  algorithms  are  rich  in  matrix-matrix  BLAS-3  computations,  making  them 
efficient  on  vector  and  parallel  architectures.  Preliminary  numerical  experiments  are 
reported,  comparing  these  row  Householder-based  rank-/:  modification  schemes  with  k 
applications  of  the  classical  updating  and  downdating  covariance  schemes  which  use 
Givens  and  hyperbolic  rotations. 
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1  Introduction 

In  this  paper  we  introduce  new  row  Householder  and  row  hyperbolic  Householder  trans¬ 
formations,  which  zero  one  row  of  a  matrix  at  a  time  when  applied  from  the  left.  These 
transformations  are  a  generalization  of  an  idea  first  proposed  by  Bartels  and  Kaufman  [3] 
and,  as  in  classical  Householder  transformations,  are  rank-1  modifications  to  the  identity  ma¬ 
trix.  We  will  discuss  their  use  in  developing  efficient  algorithms  for  recursive  least  squares 
problems  of  the  sliding  window  type. 

In  [3],  Bartels  and  Kaufman  consider  schemes  for  modifying  R,  where  X  =  QR  and  X  is 
the  given  data  matrix,  subject  to  rank-2  updates  of  X.  To  solve  these  problems  efficiently, 
they  introduce  a  modified  Householder  transformation  which,  when  applied  from  the  left, 
can  zero  entries  simultaneously  in  two  column  vectors.  Here  we  suggest  a  generalization  to 
this  transformation  which,  when  applied  from  the  left,  can  eliminate  all  elements  in  a  row 
of  a  matrix.  We  then  illustrate  how  these  transformations  can  be  very  useful  in  developing 
efficient  algorithms  for  modifying  R-1  (rather  than  R)  subject  to  rank-t  changes  in  X. 
(Algorithms  for  modifying  R  subject  to  rank-&  changes  in  A'  were  considered  in  [17]  and 
analyzed  in  [6]).  We  show,  in  terms  of  operations  counts,  that  our  algorithms  are  more 
efficient  for  modifying  Pr1  than  k  applications  of  the  classical  algorithms  based  on  Givens 
and  hyperbolic  rotations  (see,  for  example,  Pan  and  Plemmons  [14].)  Moreover,  as  Bartels 
and  Kaufman  show  for  rank-2  modifications,  our  algorithms  are  rich  in  matrix-matrix  BLAS- 
3  computations,  making  them  even  more  economical  on  high  performance  architectures  than 
k  applications  of  the  rank-1  modification  schemes. 

The  outline  of  this  paper  is  as  follows.  In  Section  1  we  introduce  the  new  row  Householder 
transformations.  In  Section  2  we  show  how  these  transformations  can  be  used  to  efficiently 
update  least  squares  solutions  when  observations  are  added  and/or  deleted  from  the  linear 
system.  In  Section  4  we  consider  downdating  computations.  In  Section  5  we  discuss  compact 
WY  representation  of  products  of  row  Householder  transformations,  and  in  Section  6  we 
provide  some  numerical  experiments  and  some  concluding  remarks. 


2  Row  Householder  Transformations 

In  this  section  we  introduce  a  row  Householder  transformation,  which  is  a  rank  1  modification 
to  the  identity  matrix,  that  when  applied  from  the  left  will  eliminate  k  elements  in  a  row  of 
a  matrix  at  once.  These  row  Householder  transformations  are  still  reflections.  As  pointed 
out  to  the  authors  by  R.  Funderlic  upon  reading  a  preliminary  version  of  the  manuscript, 
row  Householder  reflections  can  be  interpreted  geometrically  in  the  following  way.  Given 
two  three  dimensional  vectors  in  three  space,  what  one  is  doing  is  finding  a  reflection  that 
takes  the  plane  determined  by  the  two  vectors  into  the  y  —  z  plane.  Moreover,  it  appears 
that  Householder  reflections  of  the  type  described  in  this  paper  can  be  used  to  eliminate 
contiguous  sequences  of  elements  in  different  rows  by  applying  a  single  reflection  from  the 
left.  That  possibility  is  not  considered  in  this  paper,  but  is  a  topic  of  future  investigation 
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We  will  split  our  discussion  into  two  subsections.  The  first  will  consider  row  Householder 
transformations  which  are  orthogonal,  and  the  second  subsection  will  consider  transforma¬ 
tions  which  are  pseudo  orthogonal  with  respect  to  a  signature  matrix  <1>. 


2.1  Orthogonal  Row  Householder  Transformations 

The  row  Householder  transformation  we  introduce  in  this  section  is  a  generalization  of  an 
idea  first  proposed  by  Bartels  and  Kaufman  [3j.  Let  B  be  a  (k  +  1)  x  A*  matrix  of  the  form 


B  = 


'  bT  ' 

D  ’ 


where  D  is  nonsingular. 

Suppose  we  wish  to  eliminate  the  first  row  of  B  (i.e.,  bT )  by  premultiplying  by  an  or¬ 
thogonal  matrix.  (Note  that  this  discussion  applies,  in  general,  to  the  case  where  we  want 
to  eliminate  the  jth  row  of  B.  In  this  case  we  simply  permute  the  j,h  row  to  the  top  of  B.) 
In  order  to  accomplish  this  we  construct  a  Householder  transformation 


P  - 


where  p  €  3?fc+1  and  A  =  pTp/ 2,  such  that 


PB  = 


(1) 


(2) 


In  order  to  illustrate  how  this  can  be  done  let 


p  = 


7T 

7 


where  7r  is  the  first  component  of  p  and  q  is  the  vector  consisting  of  the  last  k  components 
of  p. 

If  P  has  the  form  (1)  and  satisfies  (2),  then  we  obtain  the  relation 


'  bT  ' 
D 


-  \p{*bT  +  qT D) 


0 

D 


(3) 


From  the  first  row  of  the  relation  (3)  we  obtain 


DTq  =  pb,  (4) 

where 

p  =  (A/tt  -  7T ) .  (5) 

The  relation  (5)  together  with  A  =  pTpj 2  gives 

ir  =  -p±  y7 p2  +  qTq.  (6) 
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In  order  to  avoid  loss  of  accuracy  in  computer  finite  precision  arithmetic  we 
pick  the  sign  so  to  maximize  the  magnitude  of  n.  Then  n  can  be  expressed  as 
follows 

7T  =  -p(]  +  V^l  +  zTz)  (7) 

where  z  =  D~Tb.  We  note  that  we  have  one  degree  of  freedom  here.  Since  p  is  a  free 
variable,  we  suggest  choosing  p  =  1/||6||2.  If  ||fcj|2  =  0,  we  simply  set  P  —  I. 

In  general,  we  have  the  following  algorithm. 

Algorithm  ROWHT 

Input:  BT  =  [6  DT],  where  D  €  3f?* x *  is  nonsingular. 

Output:  p  €  3?fc+\  where  P  =  /  —  \ppT ,  A  =  pTp/2 ,  has  the 
property  that  the  first  row  of  PB  is  all  zeros. 

if  ||6||2  =  0 
[  P  =  0,  P  =  I 

else 

p  =  i/m\2 

solve  Dt q  =  pb 
*=-p-  +  qTq 

tT  —  fir  rTf 


A  Householder  transformation  (computed  by  algorithm  ROWHT)  which  zeros  elements 
in  a  row  vector  will  be  called  a  row  Householder  transformation  to  differentiate  it  from  the 
classical  column  Householder  transformation  which  zeros  elements  in  a  column  vector. 

The  algorithm  ROWHT  will  have  good  numerical  properties  as  long  as  (4)  is  solved 
by  a  numerically  stable  method.  This  is  made  precise  by  the  following  lemma. 


Lemma  1  Let  t  be  the  machine  relative  precision  and  q  satisfies 

(Dt  +  6DT)q  =  pb 


(8) 


with  ||<5.DT||  =  0(e||Z?||).  Further,  let  n  =  —  p  —  \J p2  +  qTq,  pT  =  [ir  qT  },  X  =  qTq/2  and 

p  =  i-lppT, 

Then  there  exists  a  perturbation  SB  of  the  matrix  B  such  that 

P(B  +  6B)=  Qd 


(9) 


and  ||<5Bj|  =  0(c||B||). 
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Proof:  The  proof  is  straightforward  and  hence  omitted. 

REMARK:  It  is  important  to  note  that  P  does  not  have  to  be  close  to  P  defined 
by  (1)  and  (2).  Similarly,  D  does  not  have  to  be  close  to  D  in  (2).  The  situation 
here  is  analogous  to  that  of  the  QR  decomposition  of  a  pertubed  matrix  X-\-8X 
where  the  factors  of  the  perturbed  matrix  can  differ  from  the  factors  of  the  origin 
matrix  X  by  as  much  as  0(cond(X)\\SX\\) ,  see  Stewart  [19].  However  what  is  essentia 
from  the  numerical  analysis  point  of  view  is  that  P  is  orthogonal  and  zeros  the 
first  row  of  a  nearby  matrix  B  +  6B. 

Note  that  finding  p  requires  solving  a  k  x  k  system  of  linear  equations  which  in  general 
amounts  to  0(k3)  operations.  However,  if  the  QR  decomposition  of  D  is  available,  the  cost 
of  finding  p  is  decreased  to  0(k2)  operations. 

In  the  sequel  we  will  encounter  the  problem  of  annihilating  r  rows,  r  >  1,  of  a  (k  +  r)  x  k 
matrix  by  finding  an  orthogonal  P  such  that 


I 


■*r 

..  .  o 

. 

i - 

0 

.  D  . 

Such  a  P  can  be  constructed  as  a  product  of  r  row  Householder  transformations  Pt,  P  = 
PrPr- i  •••Pi,  with  P,  annihilating  row  i  of  the  matrix.  As  the  major  cost  of  determining 
such  transformations  is  in  solving  systems  of  linear  equations,  it  is  worthwhile  to  attempt 
to  decrease  this  cost.  This  can  be  done  by  maintaining  and  updating  the  QR  decomposition 
of  the  bottom  k  x  k  submatrix.  For  the  sake  of  illustration  we  show  the  first  step  of  this 
process.  Let 


B  = 


bj 

D 


and  let 

Do  =  D  =  Q0Ro 

be  the  QR  decomposition  of  D  which  is  assumed  to  be  given.  Let  Px  =  I  —  pxpj /\x  be  a 
modified  Householder  transformation  such  that 


Pi 


0 

% 

bl 

bj 

bj 

D 

Pi 

L  J 

Then  the  form  of  Px  implies  that 

Dx  =  Do 


r~P\bJ  =  QqRo  — 7~P\b[ 


(10) 
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Thus  the  QFt  factorization  of  D\  can  be  obtained  at  the  cost  of  13k2  multiplications  (and 
13k2  additions)  by  updating  the  QR  factorization  of  D0  after  a  rank-1  change  of  D0,  [111. 
This  has  to  be  repeated  7"  —  1  times  resulting  in  the  total  of  0(rk 2)  operations  for  the  overall 
process  of  computing  all  transformations  Pt,  i  =  1,  •  •  • ,  r. 


2.2  Row  Hyperbolic  Householder  Transformations 

Let  $  =diag(±l)  be  an  (k  +  1)  x  (k  -f  1)  diagonal  matrix,  and  suppose  p  is  a  vector  of  length 
k  +  1  with  pT$p  >  0.  Then  a  Hyperbolic  Householder  transformation  is  a  matrix  of  the  form 

P  =  $-jPpT  (11) 

where  A  =  |pT$p.  The  matrix  P  is  a  pseudo  orthogonal  matrix  with  respect  to  4>,  i.e,, 

PT$P  = 


Hyperbolic  Householder  transformations  are  typically  used  to  introduce  zeros  into  a  col¬ 
umn  of  a  matrix,  and  were  studied  in  detail  by  Rader  and  Steinhardt  [17].  Here  we  introduce 
a  row  Hyperbolic  Householder  transformation  which  eliminates  entries  in  a  row  of  a  matrix. 
The  discussion  in  this  subsection  is  similar  to  that  given  in  §2.1  for  the  (orthogonal)  row 
Householder  transformations. 

Let  B  be  a  (k  +  1)  x  k  matrix  of  the  form 


B  = 


bT 

D 


where  D  is  nonsingular.  Suppose  we  wish  to  eliminate  the  first  row  of  B  using  a  transfor¬ 
mation  of  the  form  (11).  As  in  §2.1  this  can  be  illustrated  as  follows.  Let 


p  = 


7T 

<7 


where  7r  is  the  first  component  of  p  and  q  is  a  vector  consisting  of  the  last  k  components  of 
p.  Now  suppose 


'  bT  ' 

'0r' 

D 

D 

Then,  assuming  P  has  the  form  (11),  where 

$  = 

we  have 


4>i  0 

0  $ 


<f>  i 

0  ' 

'  bT  ' 

1 

7T 

0 

l> 

D 

A 

.  <7  . 

(wbT  +  qT  D) 


O7 

D 
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Thus,  we  obtain 

DTq  =  fib  (12) 

where  fi  =  (X^i/tt  —  n). 

Now,  if  we  fix  fi,  then  we  can  solve  (12)  for  q.  Once  q  is  known,  then,  using  fi  — 
(A</>i/7t  —  ?r),  we  have 

7T2  -f  7 Tfl  —  4>\X  —  0. 

Thus,  since 

A  =  =  i(<£j  f  qT$q), 

and  since  <f>\  —  1,  we  obtain  the  relation 

7r2  -f  2tt fi  —  <j>iqT$q  —  0. 


Thus,  if 


fi2  +  <j)\qT$q  >  0, 


(13) 


we  have  _ 

7r  =  —  fi  —  sgn (fi)y  ft2  +  4>iqT$q. 

We  point  out  that  the  requirement  p2  -f  (j>\qT$q  >  0  is  satisfied  for  our  problem  of  inverse 
matrix  modifications.  This  will  be  discussed  in  further  detail  in  Section  4. 

As  for  the  (orthogonal)  row  Householder  transformations,  we  suggest  choosing  fi  = 
1/||6||2,  and  P  =  $  if  ||6||2  =  0.  The  following  algorithm  summarizes  the  above  discus¬ 
sion. 


Algorithm  ROWHHT 

Input:  BT  =  [6  DT],  where  D  €  $lkxk  is  nonsingular. 
Output:  p  €  3?/c+1,  where  P  —  $  —  \ppT ,  A  =  pT$p/ 2,  has  the 
property  that  the  first  row  of  PB  is  all  zeros. 


if  ||&||2  =  0 
[  p  =  0,  P  =  4> 
else 


/*  =  i  mu 

solve  DTq  —  pb 
w  -  -p-  \J fi2  +  <t>\qT$q 


P 


T 

1  =  (7T 


Similarly  as  for  the  orthogonal  case,  a  hyperbolic  Householder  transformation  (computed 
by  algorithm  ROWHHT)  which  zeros  elements  in  a  row  vector  will  be  called  a  row  hyperbolic 
Householder  transformation.  If  the  QR  decomposition  of  D  is  available  the  cost  of  finding 
p  is  of  the  order  of  0(k2)  operations.  For  the  problem  of  annihilating  r  rows,  r  >  1,  of  a 
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(k  +  r)  x  k  matrix  B  that  cost  is  of  the  order  of  0(rk 2)  operations  (see  the  discussion  at  the 
end  of  Section  2.1). 

3  Modifying  the  Inverse  Cholesky  Factor 

Let  X  be  a  real  m  x  n  matrix  with  full  column  rank,  and  let  s  be  a  real  vector  of  length  m. 
Consider  the  least  squares  problem 

min  ||s  —  Xia||2.  (14) 

It  is  well  known  (see,  for  instance  [12])  that  this  problem  can  be  solved  by  finding  the  QR 
factorization  of  X.  Specifically,  let  X  =  QR ,  where  Q  is  an  m  x  n  matrix  with  orthonormal 
columns,  and  R  is  an  n  x  n  upper  triangular  matrix.  Then  the  solution  to  (14)  is  given  by 

w  —  R~lQTs. 

In  many  applications,  such  as  signal  processing,  it  is  often  required  to  recalculate  w 
when  successive  observations  (be.,  equations)  are  added  to  and/or  deleted  from  (14).  In 
this  section  we  consider  updating  the  solution  w  to  w  when  k  new  observations  are  added 
to  the  system,  and  downdating  w  to  w  when  k  observations  are  removed  from  the  system. 
This  method  is  called  recursive  least  squares  (RLS),  and  can  be  reformulated  as  a  k- step 
process  of  k  successive  modifications  of  w  after  addition/deletion  of  a  single  observation. 
Such  rank-1  modifications  are  most  often  realized  by  plane  rotations  and  have  been  studied 
by  many  authors.  In  this  paper  we  treat  multiple  addition/deletion  of  observation  as  a  block 
process  in  a  manner  analogous  to  that  presented  in  [17].  However,  unlike  in  [17]  where  the 
the  upper  triangular  factor  in  the  QR  decomposition  of  X  was  modified,  this  paper  proposes 
algorithms  for  direct  modification  of  the  inverse  of  the  triangular  factor.  This  procedure  is 
called  the  covariance  method  in  RLS  computations.  We  will  show  how  the  row  Householder 
transformations  described  in  Section  2  can  be  used  to  design  efficient  sliding  data  window 
RLS  covariance  algorithms. 

3.1  Inverse  Updating 

Let  X  =  QR  be  the  QR  factorization  of  X.  Suppose  k  new  observations 

K  «]■ 

where  YT  6  3?fcxn  and  u  €  are  added  to  the  data  defining  the  least  squares 
problem  (14).  We  first  show  how  R_1  can  be  updated  to  R-1 ,  where 

'  X  1 

X  =  yT  =  QR 


8 
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is  the  QR  factorization  of  X .  We  then  show  how  the  solution  w  of  (14)  can  be 
updated  to  the  solution  w  of 


min 


s 

'  X  ' 

u 

YT 

w 

It  is  well  known  that  there  exists  an  orthogonal  matrix  H  such  that 


H 


R  ' 

'  R  ' 

Yt  _ 

0T  . 

(15) 


(1G) 


The  matrix  H  can  be  constructed  as  a  product  of  (n  +  k)  x  (n  4-  k)  Householder  trans¬ 
formations  //;,  i  =  1, •••,«,  such  that  //,  annihilates  subdiagonal  elements  in  column  i, 
i  =  1 ,  •  •  • ,  n,  of  the  matrix 

f  D 

U  U  U  '  ^ 

Hi- 1  ‘  •  •  ti2n\ 


\'T 


It  is  known  that  if  H  is  orthogonal  and  satisfies  (16),  then  H  also  updates  the  inverse  of  H, 
namely 

r  d-t  i  r  d-t  i 

(17) 


H 


where  E  is  an  n  x  k  matrix.  To  see  this,  note  that 


‘  R~T  ' 

'  R~T  ' 

°T 

.  &  . 

/=[*-'<>]  yr  =  [  ft-1  0  ]  HtH  yT  =[u  E 


R 


R 

0T 


Thus  U  =  R~1. 

We  would  like  to  be  able  to  work  with  R~T ,  and  not  with  R  explicitly,  since  the  triangular 
solves  needed  in  solving  systems  associated  with  R  can  then  be  replaced  by  matrix-vector 
or  matrix-matrix  multiplications.  The  foliowing  lemma  shows  how  we  can  construct  an 
orthogonal  matrix  H  satisfying  (16)  and  avoid  using  R  explicitly. 

Lemma  2  Let  V  =  —R~TY,  and  let  H  be  an  orthogonal  matrix  such  that 


H 


'  V  ' 

'  0  ' 

.  ^  . 

D 

(18) 


where  Ik  is  the  k  x  k  identity  matrix  and  D  is  a  k  x  k  matrix.  Then 

H 

If  U  is  upper  triangular  then  U  —  R  and 

H 


R 

'  U  ‘ 

YT 

0 

'  R-T  ‘ 

'  r~t  ‘ 

°T 

,  ET 

(19) 


(20) 


where  E=  R~lV D~' . 
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Proof:  The  proof  for  k  =  1  can  be  found  in  [14].  For  k  >  1  one  proceeds  as  follows.  Let 


'  V 

R 

‘  0  U 

. p 

YT  _ 

D  YT 

(21) 


From  the  orthogonality  of  H ,  the  definition  of  V  and  the  fact  that  D  is  nonsingular,  it  follows 
that  Y  =  0  and  hence 

RtR  +  YYt  =  UTU  . 

Thus  if  U  is  upper  triangular  with  positive  diagonal  elements  then  U  —  R.  From  (17),  for 
the  inverse  we  have  an  analogous  relation,  namely 


'  V 

r~t  ' 

'  0 

R~T  ' 

.  ^ 

0 

_  D 

ET 

Now  (22)  implies 


'  VTV  +  I  VTR~T  ' 

'  dtd 

dt  et 

R-'V  R~XR~T 

ED 

R-lR-T  +  EET 

from  which  one  obtains  that 

E  =  R~l  VD~l  . 

This  completes  the  proof.  □ 

The  relation  (22)  shows  that  it  is  possible  to  work  with  the  inverses  only.  The  condition 
that  has  to  be  satisfied  is  that  application  of  the  transformation  H  in  (22)  has  to  result  in 
a  lower  triangular  matrix  U~T . 

We  now  show  how  to  construct  an  orthogonal  matrix  H  satisfying  (18)  and  (19).  To  do 
this,  we  will  use  the  row  Householder  transformation.  More  precisely,  suppose  that  we  have 
constructed  row  Householder  transformations  Pi,  Pj, . . . ,  Pj  such  that 


Pj--  PiPx 

'  V  ‘ 
I 

= 

■  o,  • 

Vi 

[  D,  J 

where  Oj  denotes  the  j  x  k  matrix  of  all  zeros,  and  V}  £  J^xk  and  D:  £  &kxk.  Then 
using  Algorithm  ROWHT,  we  find  pj  —  [x;  qf\  so  that 


- i 

ci> 

'  f  ' 

bj+ 1 

where  vj  is  the  first  row  of  V}  and  PJ+X  =  l  —  j-PiPj  •  Then  PJ+  \  is  simply  given  by 

PHi  =  1  ~  yPjPj’ 
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13 


'  R~r  ' 

'  ir1  ' 

0 

El 

where  p}  —  [0,  ■  ■  •  ,0,  x;,  U,  •  •  •  ,0,  (/_,]  (the  j- th  component  oi  />,  i*.  r,,  the  last  k  components 
of  p}  form  the  vector  q2,  and  all  other  components  are  zeros.},  it  is.  now  easy  to  see  that 
P  —  Pn  -  PiP\  satisfies  (18)  and  hence 


P«  -PJ\ 


as  R~T  is  by  construction  lower  triangular  and  hence  the  desired  downdated  factor. 

Now  that  we  have  a  scheme  for  updating  R~r ,  we  need  to  use  this  information  to  effi¬ 
ciently  update  the  least  squares  solution  w  to  w.  The  following  theorem  shows  how  this  can 
be  done. 


Theorem  1  Let  H  satisfies  that  is 

fi 


'  V  R~r  ' 

'  u  R~r  ' 

E  o 

D  ET 

If  w  is  the  solution  to  (??),  then  tht  solution  to  (15)  is  i/irtri  by 

w  =  tr  —  ED'1  (u  —  )  Tir) 

Moreover  E  —  R~XVD~ '. 

Proof:  Let 


-V  =  Q 


R 

0 


<?T-< 


s; 

Si 


where  Sj  6  3RU  and  €  3R4 .  Then  (15)  can  be  rewritten  as 


mm 


$ 

u 


x 

YT 


ll' 


mm 


•si 

'  R  ' 

•sr 

— 

0 

u 

VT 

ir 


(23) 


Furthermore,  let 


H 


’S1 

h 

u 

u 

■  u  -  YTw  ' 

V’T.Sj  +  u 

'  VT 

1 

*1 

w 

R~xs  i 

R~x 

0 

u 

Note  that  w  —  R  ’sj  and  hence  from  the  definition  of  V,r  we  have  the  relation 


( 25  j 


Using  the  definition  (24)  and  the  assumption  of  the  theorem  the  right  hand  side 
of  (25)  simplifies  as  follows 


(2b) 


'  vr  r 

■H 

0 

Dr  ' 

bTu 

l)Tu 

R-1  0 

u 

Rrx 

E 

u 

/V-'s,  +  Eh 

li'  +  Ell 
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as  from  (23)  we  have  that  w  —  R~lsi.  Now  the  theorem  follows  from  the  fact 
that  the  left  hand  side  of  (25)  and  the  right  hand  side  of  (26)  are  equal . 

□ 

Thus,  summarizing  the  results  of  this  section,  we  obtain  the  following  algorithm. 


Algorithm  IUP -jfc 

Given:  R~T  and  w ,  where  A'  =  QR  and  w  solves  (??). 
Input:  New  set  of  k  observations  [YT  u). 

Then  this  algorithm  computes  R~T  and  w,  where 


X 

YT 


=  OR 


and  w  solves  (15). 


1.  Compute  V  —  —R~TY. 
Cost  kn2/ 2  multiplications. 


2. 


Find  H  —  Pn  ■■  P-2  Pi,  where  P,  are  row  Householder  Transfor 
mations,  such  that 


fl 


■  V 

'  0  ' 

I 

b 

Cost  13  •  k2n  multiplications. 
3.  Update  R~T  to  R~r: 


H 


R~t  ' 
ET 


Cost  kn2  multiplications. 

4.  Update  tv  to  w: 

w  =  t v-  ED~t(u  -  Ytw) 

Cost  ~k2  +  2 kn  multiplications  (as  from  (10)  the  QR  decomposi¬ 
tion  of  D  is  already  available  from  step  2). 


The  total  cost  for  Algorithm  IUP-A:  is  |  ■  kti 2  +  13  •  k2n  -f  2  •  kn  +  \k2  multiplications. 
We  note  that  the  straight  forward  implementation  of  the  rank-1  method  of  Pan  and  Plem- 
mons  [14]  would  require  ^kn2  +  0(kti)  multiplications.  Thus,  roughly  speaking  speaking 


12 


13 


Algorithm  IUP -k  requires  less  multiplications  than  the  method  described  in  [14]  j 
when  n  >  13&.  The  major  advantage  of  Algorithm  IUP-fc  is  that  it  is  rich  in  BLAS 
level  2  sind  BLAS  level  3  operations  which  may  lead  to  a  more  efficient  implementatio 
on  parallel  computers. 


3.2  Inverse  Downdating 

Let 

s  = 

where  ZT  G  3Rfcxn  and  d  €  Rfc,  are  to  be  deleted  from  the  data  describing  (14). 
We  now  describe  a  method  for  downdating  w  to  the  solution  w  of 

min  ||s  —  ATtI>||2.  (27) 

We  first  show  how  R~l  can  be  updated  to  R~l  ,  where 

A'  =  QR 


X 

ZT 


is  the  QR  factorization  of  A'. 

Note  that  as  long  as  A  is  full  rank  then 

RtR~ZZt>  0.  (28) 

The  Cholesky  factor  R  of  X  satisfes 

RtR=  RtR-ZZt  . 


In  [17]  it  is  shown  that  there  exists  a  pseudo  orthogonal  transformation  //  with 
respect  to  the  signature 


In 
0  - 


such  that 


H 


R  ' 

'  R  ' 

zr 

°T 

(29) 


The  matrix  H  can  be  constructed  as  a  product  of  (n  +  k)  x  (n  +  k)  Hyperbolic  House¬ 
holder  transformations  //,,  i  =  1  such  that  //,  annihilates  subdiagonal  elements  in 

column  i,  i  =  1,  •  ■  • ,  n,  of  the  matrix 


//,_i  -  -  -  y/2//i 


R 

ZT 
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Similarly  as  for  orthogonal  transformations,  if  the  hyperbolic  H  satisfies  (29)  then  11  also 
downdates  the  inverse  of  R.  To  see  this,  note  that 


I  =  [  R~l  0  ]  $ 
Thus  U  =  R~\  and 


R 

ZT 


=  [  R~l  0  ]  Ht$H  =  [  u  F  ] 


R 

0T 


H 


'  R-T  ' 

'  R~T  ' 

0 

FT 

(30) 


We  would  like  to  work  with  the  inverses  directly  and  hence  need  a  way  for  constructing 
H  satisfying  (29)  without  any  explicit  reference  to  R.  The  following  lemma  provides  means 
just  for  that. 

Lemma  3  Assume  RT  R—  Z  ZT  >  0.  Let  V  =  R~T Z ,  and  let  H  be  a  hyperbolic  (with  respect 
to  $)  transformation  such  that 

ful  Pol 

(31) 


(32) 


'  V  ' 

'  0  ' 

.  . 

D 

H 

where  4  is  the  k  x  k  identity  matrix  and  D  is  a  k  x  k  matrix.  Then 

H 

If  U  is  upper  triangular,  then  U  —  R  and 

H 


R  ' 

'  0  ' 

ZT 

0 

■  R~T  ' 

'  R~T  ' 

°T 

FT 

(33) 


where  F  —  —R  lVD  1 . 

Proof:  The  proof  for  k  =  1  can  be  found  in  [14].  For  k  >  1  one  proceeds  as  follows.  Let 

H 


'  V 

R  ' 

’  0 

u 

ZT 

b 

zT 

(34) 


From  the  definition  of  V  and  the  that  fact  that  H  is  hyperbolic  (with  respect  to  $)  we  obtain 
that 


VTV  ~  lk  0 
0  RTR  -  ZZT 


-DTD  ~DtZt 
-ZD  UTU  -  ZZT 


(35) 


Comparing  upper  left  entries  on  both  sides  we  get 

-DTD  =  VTV  -  lk  =  ZTR~'  R~tZ  -  Ik  . 

Now,  as  RT R  -  ZZT  >  0  then  4  -  ZT R~x  R~T Z  >  0  and  hence  D  is  nonsingular. 


14 


From  (35)  and  the  nonsingularity  of  D  it  follows  that  Z  =  0  and  hence 

RtR  -  ZZT  =  UTU  . 

Thus  if  0  is  upper  triangular  (with  positive  diagonal  elements)  then  U  =  R.  From  (30),  for 
the  inverse  we  have  an  analogous  relation,  namely 


V  R~T 
Ik  0 


0  r~t 
D  FT 


Now  (36)  implies 


VTV-Ik 

yTR-T  ■ 

'  -dtd 

-DTFT 

P"1  V 

R~'R~t 

-FD 

R-lR-T  -  fft 

from  which  one  obtains  that 

F  = 

-R-'VD-y  . 

This  completes  the  proof. 

□ 

The  relation  (36)  shows  that,  as  for  updating  the  inverse,  it  is  also  possible  to  downdate 
the  inverse  directly.  The  condition  that  has  to  be  satisfied  is  that  application  of  H  in  (36) 
has  to  result  in  a  lower  triangular  matrix  U~T . 

The  construction  of  H  satisfying  (36)  is  analogous  to  that  described  at  the  end  of  Section 
3.2.  Now  however  H  is  constructed  as  a  product  of  row  hyperbolic  Householder  transforma¬ 
tions.  The  only  thing  that  needs  to  be  verified  is  that  the  condition  (13)  is  always  satisfied 
for  each  factor  that  makes  up  H . 

Suppose  that  we  have  constructed  row  hyperbolic  (with  respect  to  $)  Householder  trans¬ 
formations  P\,P2i . .  ■ ,  Pj  such  that 


Pi  —  PiP* 


r  Or  1 
=  , 
Dj  \ 


where  0j  denotes  the  j  x  k  matrix  of  all  zeros,  Vj  6  anc}  £)j  g  Mk*k.  Let  vj  be  the 

first  row  of  Vj  and  let 

4>  =  [  1  0 

0  —  4  J  ' 

We  wish  to  use  Algorithm  ROWHHT  to  find  pj  =  [fry  </j]  so 


satisfies 


PJ+1  =  «  -  j -Pjpf, 


A-+.  A 
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Note  first  that  Dj  is  nonsingular.  The  condition  (13)  for  P3  becomes 


Aj  -  qJqj  >  o 

(37) 

where  from  (12)  qj  is  given  by 

9i  =  Aj  D~TVj  . 

(38) 

Substituting  (38)  in  to  (37)  we 

obtain 

Al(l~^1^)>0 

(39) 

Note  however  that  from 

Dj  Dj  —  VjvJ  >  0 

(40) 

(which  is  satisfied  because  Dj  Dj  —  VjVj  >  0)  it  follows  that 

1  -vjDj'DfvjX), 


which  shows  that  (13)  is  satisfied. 

Now,  the  construction  of  H  proceeds  in  a  straightforward  manner,  exactly  as  in  the 
(orthogonal)  updating  case. 

The  scheme  for  downdating  R~T  can  be  extended  to  downdating  the  least  squares  solution 
w  to  w.  The  following  theorem  shows  how  this  can  be  done. 

Theorem  2  Let  H  satisfy  (31),  that  is 

0  R-T  ' 

D  FT  ' 

If  w  is  the  solution  to  ("11),  then  the  solution  to  (27)  is  given  by 

w  =  w  +  FD~T(d  —  Z1  w) 


H 


'  V  R-T 

*-< 

O 

Moreover  F  =  —R  1VD~1. 

Proof:  The  proof  is  analogous  to  that  of  Theorem  1  and  hence  is  omitted. 

□ 


Thus,  summarizing  the  results  of  this  section,  we  obtain  the  following  algorithm. 
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Algorithm  IDOWN-fc 

Given:  R~T  and  w,  where  X  —  QR  and  tv  solves  (??). 
Input:  Set  of  k  observations  [ZT  <fj. 

Then  this  algorithm  computes  R~T  and  w,  where 


X  = 


X 

ZT 


=  QR, 


)C  =  Qk  and  w  solves  (27). 

1.  Compute  V  =  —  R~rZ. 

Cost  kn2/ 2  multiplications. 

2.  Find  H  =  Pn-  ■  ■  P-2  Pi,  where  Pi  are  row  hyperbolic  Householder 
transformations,  such  that 


H 


'  V 

■  0  ' 

I 

D 

Cost  13  ■  k2n  multiplications. 

3.  Downdate  R~T  to  R~T: 

R~t 
FT 


Cost  kn 2  multiplications. 

4.  Downdate  w  to  w: 


w  —  w  —  FD  T(d  —  ZTw) 

Cost  | k2  +  2 kn  multiplications  (as  from  (10)  the  QR  decomposi¬ 
tion  of  D  is  already  available  from  step  2). 


It  is  easy  to  see  that  the  complexity  analysis  for  the  above  algorithm  is  the  same  as 
Algorithm  IUP-fc.  That  is,  the  total  cost  is  |  •  kn 2  +  13  •  k2n  4-2  ■  kn  +  | k2  multiplications. 
Moreover,  the  straight  forward  implementation  of  the  rank-1  downdating  method  of  Pan  and 
Plemmons  [14]  requires  | kn2  -f  0(kn )  multiplications. 
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4  Block  WY  Representation  for  Products 

We  are  interested  in  row  Householder  methods  that  are  rich  in  matrix-matrix  operations  in 
order  to  increase  the  efficiency  of  our  algorithms  on  vector  and  parallel  machines.  To  that 
end,  it  is  important  to  accumulate  and  apply  products  of  Householder  transformations  in 
block  form  [12]. 

It  is  known  (see  e.g.,  Schreiber  and  Van  Loan  [18]),  that  products 

Q  =  HnHn.i  •  •  -  //j 

of  column  oriented  Householder  transformation  matrices 

Ht  =  I  —  w,wj ,  z  =  1,  •••,«,  (41) 

defined  by  m-vectors  w,  with  wjw{  —  2,  can  be  accumulated  in  a  compact  WY  form 

Q  =  I  -  YTYt  (42) 

where  Y  is  an  m  x  n  rectangular  matrix,  and  each  of  its  columns  is  a  Householder  vector 
ie,-,  and  T  is  a  unit  lower  triangular  n  x  n  matrix.  Obviously,  then,  if  A  is  an  m  x  n  matrix 
then  H\  A  can  be  accumulated  using  matrix-matrix  operations  as 

HnHn^  Hi  A  =  QA  =  A  -  YT{Yt  A). 

An  algorithm  for  constructing  and  applying  Q  in  the  form  (42)  is  in  the  new  LAPACK 
software  system  [1].  We  remark  that  Puglisi  [16]  has  extended  the  work  in  [18]  by  giving  a 
scheme  to  compute  and  apply  the  product  form  (42)  which  involves  more  BLAS-3  matrix- 
matrix  operations,  but  which  also  requires  additional  work  and  storage. 

Clearly,  since  orthogonal  row  Householder  transformation  matrices  P  as  given  in  (1) 
can  also  be  written  in  the  form  (41),  the  same  results  on  accumulation  and  application  of 
products  of  Householder  transformations  in  block  form  apply  for  our  case.  Thus  the  use  of 
row  Householder  orthogonal  transformations  for  modifying  the  inverse  QR  factorization  is 
rich  in  level-3  BLAS  operations,  and  the  compact  WY  representation  block  algorithms  in 
LAPACK  can  be  used  for  our  application. 

The  case  of  row  hyperbolic  Householder  transformations,  used  for  downdating,  requires 
some  further  discussion.  Recall  that  for  an  m-vector  p,  and  a  signature  matrix  4>,  an  m  x  rn 
row  (or  column)  hyperbolic  Householder  transformation  matrix  can  be  written  in  the  form 

*  -  *  -  (43) 

provided  that  0  <  pj <I>p, .  The  matrix  Px  is  pseudo  orthogonal  with  respect  to  4>,  i.e., 
Pfq>pt  —  <J>,  Observe  also  that  P  =  PT .  We  now  proceed  to  show  how  to  accumulate  and 
apply  products  of  hyperbolic  Householder  transformations  in  a  compact  WK-type  represen¬ 
tation  block  form  similar  to  (42). 
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First,  we  write  (43)  in  the  form 


where  tu,-  is  an  m-vector  given  by 


P,  =  $  -  w,w[, 


Wi 


PJ  **)"'■ 


Note  that  w =  2. 

It  will  be  shown  that  products 


(44) 


Q$  =  PnPn- 1  •  •  •  Pi 


of  row  or  column  oriented  hyperbolic  Householder  transformation  matrices  (44),  defined  by 
m-vectors  in,-,  and  associated  with  the  same  signature  matrix  4>,  can  be  accumulated  in  a 
compact  WY  form 

Q*  =  d>’1  -  <S>n~lYTYT.  (45) 

A  method  for  computing  the  block  representation  (45)  is  given  by  the  following  theorem. 


Theorem  3  Suppose  Q$  =  —  $'~1YTYt  is  an  m  x  m  matrix,  pseudo  orthogonal  with 

respect  to  4>,  with  Y  m  x  i  and  with  T  a  unit  lower  triangular  i  x  i  matrix.  If  P  =  —  wwT , 

with  w  an  n-vector  such  that  0  <  wT$w,  and  zT  =  —wT^t~1YT,  then  the  product  PQq  is 
given  by 


PQo  =  4>,+1  -  <t>'Y+T+Yj, 


(46) 


where 


T  0 
1 


(47) 


Proof:  It  can  be  seen  that 


PQt  =  ($  -  WWT)  (<J>‘  -  &-1YTYt)  = 
$,+1  -  VYTYt  +  wwtV-1YTYt  -  wwT* = 
4>,+1  -  $'YTYt  -  wztYt  -  wwT<t>'  = 


r, .  . ,  i 

’to' 

■  yT  ■ 

[y,$  u>J 

zT  1 

wT4>' 

$,+1  _  VY+T+Y?. 


□ 

Notice  that  Q$  —  PnPn- i  •  ■  P\  reduces  to  $  —  YTYy  if  n  is  odd,  and  to  /  —  $YTYT  if 
n  is  even. 

The  scheme  described  in  Theorem  3  for  accumulating  products  of  hyperbolic  Householder 
transformation  matrices  has  the  same  advantages  as  the  storage-efficient  compact  WY  rep¬ 
resentation  scheme  for  the  orthogonal  case  given  in  [18].  In  summary,  the  row  orthogonal 
and  row  hyperbolic  Householder  methods  considered  in  this  paper  are  rich  in  matrix-matrix 
operations,  and  this  fact  can  be  used  to  increase  the  efficiency  of  our  algorithms  on  vector 
and  parallel  machines. 
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5  Numerical  Experiments 

In  this  section  we  provide  numerical  experiments  which  consist  of  sliding  window  recursive 
least  squares  problems  (RLS),  and  are  designed  to  compare  the  accuracy  of  our  block  method 
with  k  applications  of  the  rank-1  covariance  inverse  factorization  RLS  method  of  Pan  and 
Plemmons  [14].  In  each  of  the  examples  given  below,  we  indicate  the  length  of  the  window 
used,  and  the  number  of  observations  which  will  be  added  and  deleted. 

The  set  of  examples  we  use  here  have  been  used  to  test  the  effectiveness  of  condition 
estimators  [9,  10,  15],  and  have  also  been  used  by  Bjorck,  Park  and  Elden  [5]  to  illustrate 
how  the  corrected  semi-normal  equations  can  be  used  to  stablize  rank-1  downdating.  These 
examples  are  described  as  follows. 

Example  1:  In  this  example  we  construct  a  100  x 10  data  matrix  whose  entries  are  generated 
randomly  from  a  uniform  distribution  in  (-50,50).  We  then  scale  the  first  column  of  this 
matrix  by  multiplying  the  entries  in  the  first  column  by  10-3.  This  causes  the  windowed 
data  to  have  a  condition  number  on  the  order  of  103.  Here  we  choose  the  window  length  to 
be  20,  and  the  number  of  observations  added  and  deleted  is  k  =  5. 

Example  2:  In  this  example  we  construct  a  50  x  5  data  matrix  from  a  uniform  distribution 
in  (0,1).  In  this  case,  though,  we  add  an  outlier  of  the  form  r  x  103,  where  r  is  again  a  random 
number  in  (0,1),  to  the  (18,3)  entry.  The  effect  of  this  outlier  causes  the  data  to  become 
ill-conditioned  when  the  \Sth  row  is  added  to  the  system.  Here  we  choose  the  window  length 
to  be  8,  and  the  number  of  observations  added  and  deleted  is  k  =  3. 

Example  3:  In  this  example  we  construct  a  50  x  5  matrix.  The  first  25  rows  are  the  first  25 
rows  of  the  Hilbert  matrix.  The  second  25  rows  are  simply  the  first  25  rows  given  in  reverse 
order.  We  then  add  a  random  number,  8,  to  all  the  entries  in  order  to  control  the  degree  of 
ill-conditioning  of  the  data.  The  smaller  the  value  of  8,  the  more  ill-conditioned  is  the  data. 
As  is  done  in  [5],  we  use  8  =  10~5  and  8  =  10~9.  Here,  we  again  take  the  window  length  to 
be  8,  and  k  =  3. 

The  numerical  tests  for  the  above  examples  were  performed  using  Matlab,  and  the  right 
hand  side  vector  was  chosen  to  be  the  row  sums  of  the  data  matrix.  Thus  the  exact  solution 
is  known,  and  is  the  vector  of  all  ones.  The  quantities  reported  are  the  relative  errors  and 
residuals  for  our  block  method,  and  the  rank-1  rotation  based  method  of  Pan  and  Plemmons 
[14].  The  results  are  summarized  in  Figures  1-8,  where  the  solid  line  is  the  plot  of  the  rank-1 
method  and  the  dashed  line  is  a  plot  of  our  block  method.  Also  shown  in  the  figures  is  a 
plot  of  l/cond(X )  for  each  window,  indicated  by  -f  signs. 

We  see  from  the  figures  that  numerically  our  block  method  performs  in  a  similar  manner 
to  k  applications  of  the  rank-1  method  of  Pan  and  Plemmons.  But  since  our  methods  are 
rich  in  B LAS-3  computations,  our  block  method  is  better  suited  for  vector  and  parallel 
architectures. 

W'e  note  that,  as  for  the  rank-1  method  of  Pan  and  Plemmons,  our  block  method  can  give 
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inaccurate  results  if  the  data  becomes  too  ill-conditioned.  This  is  to  be  expected,  though, 
since  the  downdating  is  sensitive  to  ill-conditioning,  see  [20],  To  obtain  a  more  reliable 
block  method  when  the  data  is  ill-conditioned,  one  can  apply  schemes  which  also  modify 
the  Q  factor,  such  as  a  generalization  of  the  method  proposed  in  [8]  for  updating 
the  Gram-Schmidt  QR  factorization  to  the  block  case.  Another  approach  is  to  use 
the  original  data,  X.  This  could  be  done  by  extending  the  work  of  Bjdrck.  Turk  and  Ehlen 
[5],  which  uses  the  corrected  semi-normal  equations  for  rank-1  modifications,  to  the  rank 
k  case.  These  two  approaches  are  the  subject  of  the  ongoing  research  and  will 
be  reported  elsewhere. 

Perhaps  a  more  straight  forward  approach  is  to  use  a  condition  estimation  technique, 
such  as  ACE  [15],  and,  if  the  problem  becomes  ill-conditioned,  re-initialize  by  computing  a 
new  inverse  orthogonal  factorization,  producing  a  new  R~x .  That  is,  ACE  could  be  used  to 
monitor  the  conditioning  of  the  data,  which  can  be  done  in  (){u)  +  0{k ’)  operations  pei 
time  step.  The  0(A-3)  comes  from  solving  an  eigenvalue  problem  required  in  ACE.  If  tin-  data 
becomes  ill-conditioned,  one  would  then  compute  an  explicit  QR  factorization  of  the  current 
data,  to  re-initialize  the  RLS  process,  and  continue  with  the  updating  and  downdating  This 
approach  would  be  most  useful  for  problems  such  as  Example  2,  where  the  data  is  well 
conditioned  except  for  a  small  number  of  windows,  made  ill-conditioned  by  outliers.  Of 
course,  if  the  problem  is  well  conditioned,  then  our  scheme  is  very  efficient  and  needs  no 
stabilizing  modifications. 
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Abstract 

In  least  squares  problems,  it  is  often  desired  to  solve  the  same  problem  repeat¬ 
edly  but  with  several  rows  of  the  data  either  added,  deleted,  or  both.  Methods  for 
adding  or  deleting  one  row  of  data  at  a  time  are  known.  In  this  paper  we  introduce 
fundamental  rank-A:  updating  and  downdating  methods  and  show  how  extensions  of 
rank-1  modifications  for  LINPACK,  Corrected  Semi-Normal  Equations  (CSNE),  and 
Gram-Schmidt  factorizations  can  all  be  derived  from  these  fundamental  results.  We 
then  analyze  the  cost  of  each  new  algorithm,  and  make  comparisons  to  k  appbcations 
of  the  corresponding  rank-1  algorithms.  We  provide  experimental  results  comparing 
the  numerical  accuracy  of  the  various  algorithms,  paying  particular  attention  to  the 
downdating  methods,  due  to  their  potential  numerical  difficulties  for  ill-conditioned 
problems. 


Abbreviated  Title. 

Key  Words. 

AMS(MOS)  Subject  Classifications. 

1  Introduction 

A  problem  which  frequently  arises  in  signal  processing  is  the  linear  least  squai'es  problem: 

min  II  Ax  —  6|L  ( 1 ) 

re»n 
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where  A  €  3?mxn,  A  is  rank  n,  b  €  9?m,  and  m>  n.  This  problem  may  be  solved  through  a 
QR  factorization  of  the  augmented  m  x  (n  +  1)  matrix  ( A  b ), 


(*  *)  =  QH  =  q(  o  “) 


(2) 


where  Q  £  3Jmx(n+1)  with  orthonormal  columns,  U  6  3£"xn,  with  U  upper  triangular,  u  € 
3Jnxl,  and  p  is  a  scalar.  We  then  insert  the  factorization  into  the  problem: 


minx€»n  ||  Ar  -  b\\] 


minxe*n 


minxg3fn 

minte»» 


QtAx  -  QTbf2 


Ux  -  ti  ||  j  +  P2 


2 

2 


The  solution  vector  x  is  then  found  by  solving 


Ux  =  u 


(3) 


by  the  method  of  back  substitution.  The  two-norm  of  the  residual  of  the  problem  is  p.  Note 
that  once  the  factorization  is  found,  only  (/,  tx,  and  p  are  needed  to  solve  the  problem. 

Frequently,  one  has  already  found  the  QR  factorization  in  (2),  and  wishes  to  solve  (1 )  with 
one  or  more  rows  added  to  or  deleted  from  the  data  ( A  b).  This  is  known  as  the  recursive  least 
squares  problem.  Computing  the  QR  factorization  of  a  matrix  is  computationally  expensive. 
Since  one  already  has  Q  and  R  (or  just  R)  from  a  problem  that  is  close  to  the  one  we  wish 
to  solve,  we  would  like  to  save  on  computation  by  just  finding  the  new  factorization  (say 
Qn(:w  and  Rnew)  from  the  old  factorization  and  the  data  to  be  added  or  deleted.  The  new 
solution  is  then  computed  by  using  Une w  and  unew  as  above  in  (3). 

For  example,  say  one  has  k  new  rows,  (Y  c)  G  Kix*n+1\  and  it  is  desired  to  append  them 
to  the  end  of  the  data  (A  b).  Then  the  problem  becomes  given  (2),  solve 


min 

f63i'n 


A 

Y 


x  — 


This  is  called  a  rank-k  update  of  a  linear  least  squares  problem.  To  accomplish  this,  first 
make  the  following  construction: 


(  A  b  )  = 


Q  0 

0  Ik 


(  U  u  \ 

0  p 

V  Y  c) 


Since  Q  has  orthonormal  columns,  all  that  needs  to  be  done  at  this  point  is  to  apply  an 
orthogonal  transformation,  say  H ,  to  the  factorization,  designed  to  reduce  R  to  upper  tri¬ 
angular  form  (i.e.,  zero  out  {Y  c ))  while  preserving  the  orthogonal  property  of  the  columns 
of  Q. 
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This  leads  to  the  following: 


(  A  b  )  = 


Q  0 

0  Ik 


HHt  ( 


\ 


U  u 

0  p 

Y  c 


Q ii  Qn 
Qz\  Q  22 


(/  u\ 

0  p  =  QR  (4) 
0  0/ 


The  new  solution  can  be  obtained  from  Ux  =  u.  The  two-norm  of  the  residual  of  the  new 
problem  is  p.  The  problem  of  determining  Q  and  R  in  (4)  is  known  as  updating  the  QR 
factorization. 

The  other  possibility  is  to  remove  data  from  the  problem.  Assume  that  we  desire  to 
remove  the  first  fc  rows,  ( Z  d),  of  the  data  (A  b).  This  would  be  a  rank-k  downdate  of  the 
linear  least  squares  problem.  Then  given  (2),  the  problem  to  be  solved  is 


(5) 


where 


(A  b)  = 


Q  n 
Q21 


U  u\ 
0  pj 


(6) 


Here  Z  is  k  x  n,  d  is  k  x  1,  Qn  is  k  x  (n  +  1),  Q 21  is  (rn  -  k)  x  (n  +  1),  A  is  (m  -  k)  x  n, 
b  is  (m  -  k)  x  1,  and  of  course  Q  has  orthonormal  columns.  The  problem  of  finding  the 
new  factorization  (A  6)  =  QR  is  known  as  downdating  the  QR  factorization.  Recall  that  the 
updated  triangular  factor  R  was  needed  to  solve  the  updated  linear  least-squares  problem: 
similarly,  R  is  needed  to  solve  the  downdated  linear  least-squares  problem.  Depending  on 
the  algorithm  used,  the  matrix  Q  may  or  may  not  need  to  be  stored  and  downdated. 

Downdating  the  QR  factorization  is  the  reverse  of  the  updating  process  (4).  That  is,  the 
downdating  process  begins  with  the  augmented  factorization  (7). 


(A  b) 


Z  d 
A  b 


=  QR  = 


Qn  Qn 
Q  21  Q22 


(7) 


Here,  as  in  (4),  Q  is  an  orthonormal  column  matrix  containing  Q  augmented  with  k  new  or¬ 
thonormal  columns.  Then  an  orthogonal  transformation  H  (similar  to  that  used  in  updating) 
is  applied  to  obtain  Q  and  R. 


(  Z  d 
v  A  6 


Q  u  Q 12 
Q 2i  Q 22 


hth 


U  u\ 
0  p 
0  0/ 


0 

Q 


Notice  that  the  final  terms  in  (8)  have  permuted  rows  (compared  to  the  corresponding  terms 
in  (4))  since  rows  are  being  deleted  from  the  top  as  opposed  to  being  added  at  the  bottom. 
Still,  the  two-step  procedure  defined  by  (7)  and  (8)  is  the  logical  reverse  of  the  updating 
process. 

In  downdating,  the  k  new  orthogonal  columns  are  not  a  by-product  of  applying  H  as  in 
(4),  but  must  be  found  before  applying  H.  There  is  some  difficulty  in  determining  Qu  and 
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Q 22,  as  they  are  not  unique  (observe  that  an  orthogonal  transformation  may  be  inserted  in 
the  middle  of  the  factorization  (7)  that  alters  QJ2  and  Q22  hut  not  the  other  terms).  To 
assist  in  determining  Q12  and  Q 22,  we  examine  other  requirements  on  Q. 

From  equation  (8),  Q  and  H  must  satisfy 

(  Q11  Q12  \  _  /  0  h  \  rj 
\  Q21  Q22  )  V  0  0  / 


Multiplication  by  (Qn  Qn)T  —  HT(0h)T  on  both  sides  of  this  equation  yields 

(  Qn  Q12  \  /  Qn  \  —  ( 
l  Q21  A  <&/  \  0  )  ' 


(9) 


Therefore  the  first  k  rows  of  Q  must  be  orthogonal.  This  orthogonality  condition  may  be 
combined  with  expression  (7)  as  shown  in  (10). 


The  last  k  columns  of  Q  are  still  not  uniquely  determined  in  (10).  However,  choosing 
Qi2  to  be  an  upper-triangular  matrix,  denoted  simply  as  Q 32,  and  making  the  corresponding 
choice  Q22  f°r  Q 22 >  makes  (10)  an  expression  of  the  downdating  problem  in  terms  of  the  QR 
factorization  of  an  enhanced  matrix. 


Z  d  )  h\  =  (Qn  Qu\(  R  Ql  \ 
A  b  j  0  J  \  Q21  Q22  )  \  0  Qj2  j 


(11) 


This  relation  is  fundamental  to  all  rank-/:  downdating  methods  described  in  this  paper.  Since 
it  compactly  represents  all  of  the  conditions  necessary  to  determine  (j,  (11)  can  be  used  to 
determine  Q12  and  <522-  The  relation  (11)  is  an  extension  to  rank-/:  of  an  analogous  relation 
derived  in  [DGKS76]. 

Once  the  factorization  (11)  is  obtained,  we  then  proceed  as  in  (8),  constructing  some 
orthogonal  transformation  H  which  when  applied  will  produce  the  downdated  factorization. 
Here  H  operates  on  the  columns  of  Q  to  transform  (Qn  Q12)  to  (0  P),  where  P  is  an 
orthogonal  matrix.  H  can  be  constructed  such  that  HT ,  when  applied  to  R ,  will  change 
the  element  values  in  R  but  still  preserve  its  upper  triangular  property.  This  produces  the 
following  result: 


Qll  Q12 
Q21  Q22 


hht 


R  QTU 
0  QTn 


0  P 
Q  0 


U  u  0  \ 

0/5  0 

Z  cl  PT  ) 


(12) 


Because  Q  has  orthonormal  columns,  this  transformation  will  also  implicitly  zero  out  Q22- 
The  equation  Ux  =  u  is  solved  to  find  the  solution  x  to  the  downdated  problem.  The 
two-norm  of  the  residual  corresponding  to  the  downdated  problem  is  p. 
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This  paper  is  devoted  to  the  discussion  of  rank-fc  downdating  methods  derived  from  (11). 
We  emphasize  downdating  and  not  updating  because  downdating  is  numerically  harder.  The 
k  new  columns  found  in  the  downdating  problem  have  several  special  properties:  orthog¬ 
onality  in  columns,  orthogonality  in  rows,  and  triangularity.  This  allows  for  considerable 
variation  and  discussion  in  possible  implementations.  In  all  of  these  implementations,  how¬ 
ever,  the  most  important  quantities  are  the  k  orthogonal  rows,  (Qn  Q12).  These  are  needed 
for  finding  the  triangular  factor  R  of  (A  b).  If  the  matrix  Q  is  to  be  determined  as  well, 
then  Q 22  is  also  necessary.  In  Section  2  we  discuss  methods  than  maintain  Q  and  R.  Section 
3  presents  methods  that  only  maintain  R.  In  both  cases,  the  methods  for  obtaining  and 
applying  H  are  the  same,  and  Section  4  examines  possible  implementations  for  this  part  of 
the  downdating  process.  Section  5  contains  an  analysis  of  the  computation  involved  in  all  of 
the  methods  presented. 

Since  rank-one  updating  and  downdating  methods  are  known,  one  might  achieve  a  rank- 
k  modification  to  the  data  by  k  applications  of  these  rank-one  methods.  However,  rank- A: 
methods  make  use  of  matrix-vector  and  matrix-matrix  operations,  as  opposed  to  vector- 
vector  and  matrix-vector  methods  in  rank-one  algorithms.  This  may  make  rank- A;  methods 
faster  on  processors  with  caches  and  parallel  computers  than  the  corresponding  repetitive 
applications  of  rank-one  methods.  Section  6  of  this  paper  presents  experimentation  on  the 
numerical  properties  of  our  algorithms. 

2  Rank-/c  Downdating  of  the  Gram-Schmidt 
Factorization 

In  this  section  we  discuss  downdating  the  recursive  least  squares  problem  where  the  Gram- 
Schmidt  factorization  is  maintained,  i.e.,  modifying  both  Q  and  R  as  described  in  Section  1. 
Specifically  we  discuss  several  methods  by  which  to  obtain  the  k  new  columns  in  the  or¬ 
thogonal  factor.  In  Section  4  we  discuss  methods  by  which  elements  in  the  constructed 
factorization  are  zeroed  out  to  produce  the  desired  downdated  Q  and  R. 

2.1  Classical  Gram-Schmidt  on  augmented  problem 

The  first  method  is  to  use  classical  Gram-Schmidt  with  reorthogonalization  (CGS)  to  build 
on  the  orthonormal  columns  already  given.  Equation  (11)  represents  a  QR  factorization, 
which  could  have  been  accomplished  by  classical  Gram-Schmidt.  Since  we  have  Qn,  Q 21, 
and  R  already,  we  have  completed  n  -f  1  iterations  (recall  that  the  ith  iteration  of  classical 
Gram-Schmidt  produces  the  ith  column  of  Q  and  the  ith  column  of  R).  We  may  then  proceed 
with  the  remaining  k  iterations  of  the  orthogonalization  process  to  get  the  new  orthogonal 
columns. 
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2.2  Modified  Gram-Schmidt  on  augmented  problem 

Modified  Gram-Schmidt  (MGS)  can  also  be  used  to  get  the  new  orthonormal  columns.  Recall 
that  the  ith  iteration  of  MGS  produces  the  *’th  column  of  Q  and  the  t'th  row  of  R  in  and 
updates  the  columns  of  Q  to  be  formed  in  later  iterations.  Again  we  use  the  fact  that  (11) 
represents  a  QR  factorization  that  is  partially  completed.  After  n  +  1  iterations  of  MGS,  we 
have  the  following: 


(  (z  d)  Ik\_(Qu  Tl\(R  QjA 

\(A  b)  0  J  ~{Qn  T2  Jl  0  h  J 

The  only  part  of  the  above  that  we  do  not  have  is  T  =  (77  T2)t  .  We  can  find  T  by 
performing  only  the  update  portion  of  MGS  (i.e.,  T  —  T  —  qiqjT)  for  each  of  the  n  +  1 
orthogonal  columns  from  (Qf \  QTn  )T  one  at  a  time  in  ascending  order  on  the  new  k  columns 
(T/  77)T.  Then  we  can  proceed  with  the  remaining  k  iterations  of  MGS. 


2.3  Small  QR  factorization 


Note  that  from  (11) 

(  h  \  (  QnQu  +  QnQn  \ 
\  0  /  \  QnQu  +  Q22QI2  j 

which  can  be  rearranged  into  the  following: 


(Q  M 

V  Qv 


h  —  QnQu  \ 

-Q21QJ1  J 


(13) 


Since  Q\ 2  is  upper  triangular,  this  is  a  QR  factorization  and  Qu  and  Q2 2  can  be  obtained 
by  any  QR  factorization  algorithm,  in  particular  by  either  using  MGS  or  CGS. 


2.4  Separation  of  equations 

We  can  rewrite  (13)  as  follows: 

(  Q12Q12  \  —  (  h  —  QnQu 

\  Q2iQj2  )  \  —Q21QU  J 


(14) 


Instead  of  computing  a  QR  factorization  on  the  whole  factor,  we  can  solve  for  Q \2  and 
Q22  in  separate  steps.  The  top  equation  represents  a  Cholesky  factorization  that  can  be 
used  to  solve  for  Q12.  Then  Qi2  can  be  used  in  the  bottom  equation  to  get  Q 22  by  forward 
substitution. 
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2.5  QR  of  the  residual  matrix  T 

The  left-hand  side  of  (13)  is  related  to  a  least  squares  problem  associated  with  the  matrix 
defined  in  (11),  namely, 


Now  (11)  implies 


and  hence  the  solution  X  satisfies 

RX  =  Qtu  (17) 

while  the  residual  error  has  the  following  representation: 

m*  *)*-(«  HSOtf.-  (18) 

Note  that  in  (18)  the  right-hand  side  is  the  QR  factorization  of  T.  This  is  not  surprising 
as  T  is  the  residual  in  the  least  squares  problem  (16),  and  hence  the  columns  of  T  span  a 
subspace  orthogonal  to  the  column  space  of  the  defining  matrix  (A  b). 

However  it  may  not  be  easy  to  get  a  numerically  accurate  orthonormal  base  for  T  which 
is  orthogonal  to  the  column  space  of  (A  b ).  This  will  be  the  case  if,  for  example,  the  columns 
of  T  have  substantially  different  magnitudes  of  norms.  The  following  method  for  factorizing 
T  uses  a  step  of  refinement  (or  reorthogonalization)  in  order  to  provide  improved  numerical 
results. 

We  first  insert  T  into  an  augmented  factorization  problem  to  ensure  that  its  orthogonal 
columns  will  span  a  subspace  orthogonal  to  the  subspace  of  (A  b): 


((A  b)  T  )  = 


Qu  —Q 12 

Q21  —<322 


R  Rn  \ 

0  R22  j 


(19) 


From  this,  we  have  an  equation  for  T  from  which  we  can  get  the  desired  factorization: 


T  = 


Qu 
Q  21 


R 


12 


Q\2 

Q22 


R22 


(20) 


First  we  need  to  multiply  both  sides  of  (20)  by  (Qh  Q £,)  to  get  Ru- 

Rn  =  (  QTU  Ql  )  T 

Now  we  “correct”  T : 
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We  can  then  factor  Tc: 


(21) 


Qn 
V  Q22 


R22  —  Tc 


Note  that  if  we  multiply  both  sides  of  (21)  by  —(Qj2  Q22 )  ana  use  (18),  we  have  R2 2  =  Q{2, 
and  we  are  done. 


3  Downdating  R  Without  Storing  Q 

Methods  for  rank-A:  downdating  of  R  without  maintaining  Q  have  been  proposed  in  [RS86] 
and  [BS89].  The  method  proposed  in  [RS86]  is  based  on  the  fact  that  as  long  as  RTR  —  ZTZ 
is  positive  definite  then  there  exists  a  pseudo-orthogonal  transformation  H  with  respect  to 
the  signature  matrix  <f>  =  diag(In ,  —  R)  such  that 


where  U  is  the  Cholesky  factor  of  RT R  —  ZTZ.  It  is  shown  in  [RS86]  that  the  transformation 
H  can  be  constructed  as  a  product  of  hyperbolic  Householder  transformations. 

An  alternative  approach  for  rank-A:  downdating  of  R  has  been  proposed  in  [BS89].  Their 
downdating  of  R  is  treated  as  an  implicit  updating  of  U.  More  precisely,  they  show  that 
there  exists  an  orthogonal  H  such  that 


where  U  is  the  desired  Cholesky  factor  of  RT R  —  ZT Z.  It  is  shown  in  [BS89]  that  H  can  be 
constructed  as  a  product  of  orthogonal  Householder  transformations  and  U  can  be  recovered 
in  a  row  by  row  fashion  from  R,  Z,  and  H.  Note  that  (22)  and  (23)  do  not  require  any 
explicit  information  about  the  orthogonal  factor  Q. 

Recall  that  (8)  represented  downdating  as  implicit  updating  as  well.  Equation  (23)  is 
actually  embedded  in  (8)  (i.e.,  remove  the  Q  related  factors  and  associated  applications  of 
H ,  leaving  (8)).  Further,  if  one  desired  to  determine  Q  from  //,  R,  Z  determined  above, 
Q 21,  Q22  and  thus  the  downdated  data  A,  b  would  not  necessarily  be  unique. 

In  this  section  we  present  generalizations  to  the  block  case  of  two  other  methods  for 
downdating  the  R  factor  without  storing  the  matrix  Q. 

The  first  method  is  based  on  the  rank-1  downdating  method  proposed  by  [Saunders] 
that  was  later  implemented  in  LINPACK.  The  second  method  is  based  on  the  method  of 
corrected  semi-normal  equations  (CSNE)  proposed  by  [Bjo87].  These  two  methods  can  be 
derived  from  (11)  and  require  partial  information  about  Q. 

The  primary  difference  between  the  methods  presented  in  Section  2  and  the  LINPACK 
and  CSNE  methods  is  that  the  latter  methods  do  not  store  any  part  of  the  matrix  Q.  Instead, 
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<5u  is  recovered  from  the  following  relation 


u 

0 


(24) 


which  comes  from  the  first  A  rows  of  the  QR  decomposition  of  (A  b)  in  (11).  Of  the  other 
blocks  of  Q,  namely  Q 12  ,  Q21 ,  and  Q22,  only  Q n  needs  to  be  found  in  order  to  update  R. 
Each  method  finds  Qn  in  a  different  way.  Once  Qn  and  Qn  are  found,  both  the  UNPACK 
method  and  CSNE  method  use  the  orthogonal  transformation  H  defined  by  (12)  to  produce 
the  downdated  i?-factor.  For  the  purposes  of  this  discussion,  we  will  concentrate  only  on  the 
ways  in  which  the  two  algorithms  find  Qn  and  <5i2,  and  so  “a  rank-A  LINPACK  algorithm” 
refers  to  an  algorithm  for  rank-A  downdating  which  finds  Qn  and  Q\2  in  an  analogous  way  to 
the  rank-one  LINPACK  algorithm,  and  a  “rank-A:  CSNE  algorithm”  refers  to  an  algorithm 
for  rank-A;  downdating  which  finds  Qu  and  Qj2  in  an  analogous  way  to  the  rank-one  CSNE 
algorithm. 


3.1  The  LINPACK  Downdating  Algorithm 

The  rank-one  LINPACK  algorithm  is  actually  a  part  of  method  of  separation  of  equations 
(14).  Once  Qu  has  been  found  by  (24),  Qn  is  found  from  the  top  k  rows  in  (14),  namely 
from 

Q12QJ2  =  4  -  QuQn-  (25) 

Notice  that  in  the  rank-one  case,  Qn  is  found  by  a  single  square  root  operation.  The  lower 
m  —  k  rows  of  (14)  are  not  used  as  Q  does  not  need  to  be  maintained. 

Note  that  (25)  itself  can  be  viewd  as  a  downdating  problem.  Thus  Qn  can  be  formed 
either  directly  as  the  Cholesky  factor  of  /*  —  QuQJ j,  or  indirectly  by  constructing  an  or¬ 
thogonal  G  as  in  (23)  such  that 


see  [BS89] . 

3.2  The  CSNE  Downdating  Algorithm 

The  CNSE  downdating  algorithm  presented  in  [BPE92]  is  shown  in  that  paper  to  be  more 
stable  than  the  LINPACK  method  for  rank-one  downdating.  Thus  it  is  desirable  to  extend 
this  method  to  the  rank-A'  case.  The  basis  of  this  method  is  in  the  semi-normal  equations 
for  a  least  squares  problem 

min  ||  By  —  c||. 

If  U  is  the  triangular  factor  from  the  QR  factorization  of  B ,  then  the  SNE  for  a  single 
least-squares  problem  are  given  by  (26), 

UTUx  =  ATb.  (26) 
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The  corrected  semi-normal  equations  (CSNE)  method  for  solving  a  least-squares  problem 
proposed  in  [BjoS7]  uses  refinement  of  the  solution  obtained  by  the  SNE,  as  shown  in  (27). 

UTUy  =  Btc,  t  =  c  —  By  (27) 

UTU8y  =  BTt,  yc  =  y  +  8y,  tc  =  t-B8y 


The  corrected  solution  yc  and  the  corrected  residual  vector  tc  will  have  consistently  better  ac¬ 
curacy  than  the  accuracy  of  the  solution  obtained  by  the  SNE  method,  and  often  comparable 
to  that  given  by  a  standard  QR  factorization  method  (see  [Bjo87j ). 

The  CSNE  downdating  method  in  [BPE92]  is  derived  by  using  (27)  in  a  least-squares 
problem  to  approximate  the  first  column  of  the  identity  matrix.  In  the  rank-A:  case,  the 
CSNE  will  be  used  to  solve  k  simultaneous  least-squares  problems  approximating  the  first  k 
columns  of  the  identity  matrix  as  in  (15). 


Here  V  €  3?n>a'  and  $  is  a  length  k  row  vector.  Following  the  relation  (27),  it  is  now 
straightforward  to  derive  a  block  version  of  the  CSNE  downdating  method. 

Using  the  SNE,  k  systems  of  equations  are  obtained: 


Equation  (29)  can  easily  be  broken  down  into  two  triangular  systems  of  equations,  the 
first  of  which  (30)  turns  out  to  be  exactly  equation  (24).  Therefore,  Qn  is  solved  for  in 
exactly  the  same  manner  as  in  the  UNPACK  method. 


(  U  u 
V  0  p 


(30) 


The  residual  error  in  the  k  systems  of  equations  is  an  m  x  k  matrix  T , 


(31) 


each  column  of  which  represents  the  error  in  one  system.  By  substituting  this  error  back 
into  the  same  systems  and  solving  again,  correction  factors  for  Q u  and  (VT  may  be 


found. 
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The  correction  factor  8Q  is  added  to  the  matrix  producing  a  corrected  factor  Qj,.  The 

quantity  {SVT  8$T^J  is  not  actually  used  to  update  (VT  as  the  latter  quantity  is  not 

important  for  our  purposes.  However,  8V  and  6$  are  used  to  correct  T, 

=  *)(£) 

The  remaining  block  is  found  from  (18)  as  the  R  factor  in  the  QR  factorization  of  Tc. 

3.3  Hybrid  approach 

The  rank-fc  CSNE  algorithm  has  a  much  greater  cost  than  the  rank- A:  UNPACK  algorithm  in 
terms  of  floating-point  operations.  Much  of  this  cost  may  be  attributed  to  the  “refinement” 
process,  which  may  not  be  necessary  if  the  matrix  is  well-conditioned.  For  this  reason, 
[BPE92]  suggests  a  hybrid  algorithm.  In  the  rank-one  hybrid  downdating  algorithm  used  by 
[BPE92],  the  quantity  7  =  1  —  ||(3n||2  (where  Qn  here  is  just  a  vector)  is  checked  against 
a  user-specified  tolerance.  If  7  is  greater  than  a  user-specified  tolerance,  the  downdating 
problem  is  considered  well-conditioned  and  the  UNPACK  downdating  algorithm  is  used. 
Otherwise,  the  CSNE  algorithm  is  used.  The  suggested  range  of  the  tolerance  is  [0.25,  0.5] 
[BPE92]. 


4  Determining  the  orthogonal  reduction  factor  H 


This  section  discusses  the  problem  expressed  in  (12),  that  is  developing  and  applying  the 
orthogonal  reduction  factor  H  to  the  augmented  factorization  obtained  in  (il)  to  produce 
the  downdated  factors.  Given  that  H  has  to  be  orthogonal,  the  requirements  for  H  can  be 
more  compactly  represented  as  follows: 


Ht 


R  QTi 
0  QL 


U  u  0  \ 

0  p  0 

Z  d  PT  ) 


(32) 


Thus  we  must  preserve  the  triangularity  of  R  while  zeroing  out  Q J).  Note  that  methods 
for  Gram-Schmidt  factorizations  and  for  Cholesky  factorizations  can  make  use  of  the  same 
reduction  methods.  The  difference  is  whether  Q  is  maintained  at  all.  If  so,  H  must  also  be 
applied  to  Q21  and  Q 22  as  in  (12). 


4.1  Givens  rotations  -  column  dominant  ordering 

We  can  construct  the  orthogonal  reduction  factor  H  in  (32)  using  Givens  rotations.  The 
order  of  the  rotations  preserves  the  triangular  factor,  while  the  values  for  each  rotation  are 
chosen  in  such  a  way  as  to  produce  the  desired  reduction  in  Q 
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The  Givens  rotations  will  act  on  pairs  of  rows  of  the  target  matrix.  Let  6'/,  represent  a 
Givens  rotation  acting  on  the  ith  row  of  (/{  Q^J  ami  the  j th  row  of  (u  Q[,j .  \\  e  can  exploit 
the  fact  that  Q[2  is  upper  triangular  and  2ero  out  Qf,  a  column  at  a  time  in  the  following 
way. 

For  the  first  column  of  Q7t,  apply  the  product  of  Givens  rotations,  G'jf’1G‘T,...G7,*l  x,  to 
the  target  matrix  from  the  left,  with  each  rotation  zeroing  out  an  element  of  the  first  column 
of  QT  from  bottom  to  top.  Then  we  have  tin*  following  result: 


C'T  rr  f' 


T 

u  +  1,1 


R 


Qn 

r 


o  Qu 


n  +  1 

i 

k  -  I 
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k  -  1 

0 

Qu 

Note  that  when  the  first  column  of  Q,,  is  zeroed  out,  the  (1,1)  element  of  Qlu  becomes  1 
since  (Qn  Q12)  has  orthogonal  columns.  The  remaining  row  of  Qy,  is  zeroed  out  due  to 
the  orthogonality  property  expressed  in  (11).  R  remains  triangular. 

This  process  is  repeated  for  the  remaining  columns  of  Q7,,  where  for  the  ith  column  of 
Qy j,  the  (i,i)  element  of  Qj2  is  used  as  the  pivot.  //  is  the  cumulative  product  of  the  Givens 
rotations.  Incidentally,  P7  =  Ik. 


4.2  Givens  rotations  -  diagonal  dominant  ordering. 

The  ordering  of  Givens  rotations  to  perform  the  necessary  data  reduction  is  by  no  means 
unique.  To  show  this,  we  first  note  that  we  can  solve  (12)  differently,  that  is  we  could  just 
as  easily  reduce  down  to  the  first  k  columns  of  Q: 


Z  d) 

A  b  j 


h 

o 


(  Qn  Qu 

\  Q n  Qn 


h  IIT 


R  Qn 

o  Qn 


Our  new  requirement  is  now 
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One  way  to  do  this  with  Givens  rotations  is  to  zero  out  (Qn  Qu)7  a  diagonal  at  a  time, 
starting  with  the  main  diagonal  of  Qu  and  working  upwards.  K  Givens  rotations  are  needed 
to  zero  each  diagonal.  Each  rotation  works  on  consecutive  rows  (say  i  and  i  +  1).  and  each 
rotation  in  one  diagonal  operation  works  on  consecutively  higher  numbered  rows  fi  e.,  the 
first  rotation  works  on  (/,?.  +  1),  the  second  on  ( i  +  1,?  +  2).  etc.). 

Thus  the  first  step,  zeroing  the  main  diagonal  of  Q72,  involves  Givens  rotations  working 
on  rows  n  4-  1  through  n  +  1  -f  k  of  the  target  matrix.  After  application,  we  would  have  the 
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following  result: 


n  1  1  k  -  1 


QL  is  upper  triangular.  Incidentally,  note  that  p  is  found  after  only  k  Givens  rotations. 
This  process  is  repeated  for  the  proceeding  diagonals  until  Qj ,  is  upper  triangular.  The  new 
triangular  factor  and  downdated  rows  are  produced,  and  due  to  orthogonality,  PT  =  R. 


4.3  Row  Householder  transformations 

We  return  to  the  requirements  for  H  specified  in  (32).  We  can  also  use  a  special  form  of  the 
Householder  transformation  to  zero  out  Qfr  Using  a  standard,  column  Householder  transfor¬ 
mation  will  not  preserve  the  triangularity  of  ft-  However,  a  row  Householder  transformation 
[BNP92],  can  preserve  the  triangularity  of  R. 

Each  row  Householder  transformation  will  zero  out  one  row  of  Qf,,  working  from  the  last 
row  of  Qjx  to  the  first.  The  application  of  a  row  Householder  transformation  is  as  follows. 
The  transformation  is  of  the  same  form  as  a  standard  Householder  transformation,  and  is 
applied  to  the  row  that  is  to  be  partially  zeroed  (say  the  first  row’  of  the  target  matrix 
wdiere  qr  from  Q\x  is  to  be  zeroed)  and  to  the  rows  that  contain  the  residua)  bottom  square 
factor  of  the  columns  in  which  the  zeroing  takes  place  (here  it  is  Qf2).  The  elements  of  the 
Householder  vector  are  determined  by  solving  the  system  of  equations  Drp  =  pq  where  p  is 
a  normalization  constant,  and  p  is  the  Householder  vector.  After  the  first  row  Householder 
transformation  is  applied, 
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The  triangularity  of  R  is  preserved.  The  process  is  repeated  until  all  rows  of  Q[x  have  been 
zeroed  out. 


5  Work  Analysis 

In  this  section  we  attempt  to  make  a  theoretical  comparison  of  the  amount  of  work  involved 
for  the  algorithms  discussed  in  the  previous  sections.  Table  1  compares  each  of  the  methods 
used  to  obtain  the  needed  portions  of  the  necessary  k  new  orthogonal  columns.  Here  the 
comparison  is  in  terms  of  operations:  multiplies,  additions  and  subtractions,  and  divides 
and  square  roots.  The  calculations  are  based  on  a  rank -A’  downdate  being  performed  on  an 
m  x  n  data  matrix  (note  that  first  order  terms  for  multiplies  and  adds/subtracts  have  been 


13 


16 


ignored).  From  Table  1,  it  can  be  seen  that  the  most  expensive  methods  appear  to  be  CSNE, 
and  potentially  CGS  and  the  LINPACK-CSNE  hybrid  method  in  worst  case  scenarios.  The 
least  expensive  is  easily  LINPACK,  with  the  remaining  algorithms  somewhere  in  between. 


Algorithm 

Total  Operations 

Multiply 

Add,  Subtract 

Div,  Sqrt 

CGSAUG 

min 

2mn  k  +  mP  +  2mk 

2 mnk  4-  mk2  4"  2mA  4-  nk  4-  0.5 A2 

m  +  2k 

max 

4 mnk  4  2 mA2  4  4mA; 

4 mnk  4  2mA2  4  4mA  4  2nA  4  A2 

m  4  4k 

MGSAUG 

2 rank  -f  mA2  4  2mA; 

2mnA  4  mA2  4  2mA 

mk  4  A 

SMALLQR 

mnk  4  mA2  4  mA 

mnA  4  mA2  4  mA  4  A2 

mA  4  A 

SMALLCHOL 

mnk+O.Smk 2  —  1  /3A3  4- 1 .5m  A40.5  A2 

muA4  0.5mA2  —  1/3 Ar3  4  0.5mA  4  2 A2 

mA  —  0.5A2  4 
3/2  A 

RESQR 

3mnA  4  mk2  4  0.5 n2 k  4  3mA  +  Q.5nk 

3mnA  4  m  A2  4  0.5n2  A  4  5mA  4  0.5nA 

mA  4  nA  4  2 A 

LINPACK 

0.5n2A  4-  nk*  4-  1  /6k*  4-  0.5nA  4-  A2 

0.5na  k  +  nk*  +1  /CkJ  +  O.Snk  +  2.5  k'1 

nk  +  OJP  + 
2.5  k 

CSNE 

3 mnk  4*  mk2  4"  2n2  k  4-  3mA:  4  2tiA 

3m 7i A  4  771  A2  4  2?i2A  4  5mA  4  2*iA 

mk  4  4nk  4  5 A 

HYBRID 

mm 

0.5n2  A  4-  nA*2  4  1  /6A3  4-  O.SnA  4  k2 

0. 5n2  k  +  n  +  1  /6k3  +  0 .  Sn  k  +  2 . 5 k2 

nk  4  0.5A2  4 

2. 5  A 

max 

3 mnk  4-  mk2  4-  2 n2 k  4*  3 ink  4  2 nk 

3mnk  4  mA2  4  2n2k  4  5mA  4  2nA 

mk  4  4nk  4  5 A 

Table  1:  Comparison  of  the  work  involved  for  each  method  via  operation  counts 


If  one  were  to  implement  these  algorithms  on  a  microprocessor  that  had  a  BLAS  library 
available  for  it  (say,  for  example,  the  Intel  i860),  then  one  would  try  to  make  use  of  the  BLAS 
library  wherever  possible,  since  such  a  library  is  often  well-optimized  for  the  target  processor. 
If  the  processor  in  question  has  pipelining  or  vectorization  available,  operation  counts  may 
not  give  an  accurate  prediction  of  relative  execution  time.  Thus  we  also  provide  Table  2, 
which  gives  a  breakdown  of  components  of  each  algorithm  in  terms  of  BLAS  functions  and 
the  size  of  the  problem  each  call  solves. 


Algorithm 

j  BLAS-3 

1  BLAS-2  H 

GEMM 

TRSM 

GEM  V 

GER 

O (7)171  A ) 

Oink*) 

0(mn2) 

0(n*k) 

O(mn) 

0{mk) 

O(mk) 

CGSAUG  min 

2  k 

max 

4  k 

MGSAUG 

n  +  1 

n  +  1 

SMALLQR 

1 

SMALLCHOL 

1 

1 

RESQR 

1 

1 

Unpack 

1 

1 

CSNE 

3 

4 

HYBRID  min 

1 

1 

max 

3 

4 

Table  2:  Comparison  of  the  work  involved  for  each  method  in  terms  of  functions 


In  Table  4  we  compare  the  reduction  methods  in  terms  of  operation  counts.  Note  that 
while  the  ordering  of  each  of  the  Givens  rotation  sequences  is  different,  the  amount  of  work 
for  each  is  essentially  the  same. 

Table  5  compares  the  total  work  involved  for  various  methods.  Here  we  examine  Classical 
Gram  Schmidt  (maximum  work  cas,j)  with  a  Givens  rotation  methods  of  reduction,  and 
CSNE,  also  with  a  Givens  rotation  reduction. 
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Algorithm 

BLAS-l 

NfllvI2,A5CPY,DOT,SCAL 

O(k) 

CGSAUG  min 

3k 

k 

max 

4k 

2k 

MGSAUG  k 2 +  k 

SMALLQR  k2  +  k 

SMALLCHOL 

0  5k2  +  0.5k 

RESQR 

k2+k 

UNPACK 

0.5  k2  +  0.5k 

CSNE 

k2  +  k 

HYBRID  min 

0.5  k7  +  0.5k 

max 

k2  +  k 

Table  3:  Comparison  of  the  work  involved  for  each  method  in  terms  of  functions 


Algorith  m 

Tot.il  Operations  j 

Multiply 

Add,  Subtract 

Div,  Sqrt 

GIVENS1  (GS) 
GIVENS'2  (GS) 

4mnA  +  2n2  k  4*  4mA;  +  8  nA 
4mnk  -f  2 *i2A  -f  4mA;  4*  &7ik 

2 mnk  -f  n2A  +  2m k  4-  4uA 

2  mnk  4-  n?  k  +  2mk  4-  4uA 

3nA  -f  3 A' 

3  nk  +  3  A 

GIVENS1  (R) 
GIVENS2  (R) 

2n2k  +  4  nk2  +  8  nk  4*  4k2 

2u2  k  +  4uA2  -f  8 nk  4-  4  A2 

n2  k  +  2nk2  +  4nk  -f  2k2 
n2A  4-  2uA2  4-  4ufc  4-  2k? 

3?i  A  -f  3 A 

3>i A  4-  3A 

Table  4:  Comparison  of  the  work  involved  for  each  reduction  method  via  operation  counts 


We  should  also  note  that  the  storage  requirements  for  each  of  the  algorithms  are  essen¬ 
tially  the  same.  UNPACK  and  CSNE  must  both  store  R  and  (A  b).  The  Gram-Schm'u. 
methods  must  maintain  R  and  Q.  This  is  the  same  amount  of  storage  in  both  cases  since 
(A  b)  and  Q  are  the  same  size.  The  residual  QR  method  may  be  the  most  expensive  in  terms 
of  storage  since  it  has  to  store  both  ( A  b)  and  Q  (as  well  as  R). 


6  Numerical  Experiments 

The  methods  discussed  in  this  paper  were  implemented  and  tested  in  Pro-Matlab  (Version 
3.5i).  This  section  discusses  the  tests  and  matrices  used  to  compare  the  various  methods. 

The  tests  are  all  of  the  sliding  window  type.  This  type  of  test  uses  windows  consisting 
of  w  rows  of  an  m  x  n  matrix  (  m  >>  w  >  n  ).  A  series  of  least-squares  problems,  defined 
by  the  window  and  the  corresponding  subsection  of  an  m  x  1  right-hand  side  vector,  are 
solved.  Originally,  the  window  is  set  to  be  the  top  w  rows  of  the  larger  matrix,  and  the  QR 
factorization  is  computed.  At  each  step,  fc  rows  of  the  larger  matrix  are  added  at  the  bottom 


Algorithm 

Total  Operations  I 

Multiply 

Add,  Subtract 

Div,  Sqrt 

CGSAUG,  GIVENS 

CSNE,  GIVENS 

8m 7i A  4*  2u2A  4“  2??iA2  4-  8mA  4-  8nA 

3mnA4-4«2  A+3mA2  +4nfc2  —  mk+ 
lOnA  4-  4 A2 

6muA  4-  n2  k  4-  2m k2  4*  6mA  4-  GuA  + 

A2 

3»?mA4-3n2  A4*3mA2  4“2nA2  4-?uA4 
7nA  4*  2A2 

3  ?!  A  4-  7?  i  4-  7  A 

7?iA  4*  *xk 

Table  5:  Comparison  of  the  total  work  involved  for  some  methods 
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of  the  window,  requiring  an  update  of  the  QR  factorization,  and  k  rows  are  deleted  from 
the  top  of  the  window,  requiring  a  downdate  of  the  factorization.  The  updated/downdated 
factorization  is  then  used  to  obtain  the  solution  to  the  problem  corresponding  to  the  current 
window  position.  Our  tests  used  a  window  size  w  =  8. 

Two  of  the  three  matrices  used  in  the  tests  come  from  [BPE92]:  the  third  is  adapted 
from  [Bjo87]. 

Matrix  I.  The  matrix  A  is  of  size  50  x  5  with  elements  taken  from  a  uniform  probability 
distribution  in  [0,1].  Element  (18,3)  has  been  perturbed  by  a  uniform  random  sample 
from  [0, 103].  The  right-hand  side  vector  b  is  constructed  by  multiplying  A  by  the 
vector  [1, 1, 1, 1, 1]T  and  adding  to  each  element  a  random  sample  from  [0, 10"°]. 

Matrix  II.  The  matrix  A  is  again  of  size  50  x  5.  In  the  first  25  rows  of  A ,  element  [i.j ) 
is  ( i  these  rows  are  the  first  five  columns  of  a  25  x  25  Hilbert  matrix.  In  the 

bottom  25  rows,  element  (i,j)  is  the  same  as  element  (51  —  i,j),  that  is,  the  bottom 
25  rows  are  the  reflection  of  the  top  25  rows  about  the  middle  of  the  matrix.  Each 
element  of  A  is  perturbed  by  a  uniform  random  sample  from  [0, 10~5].  The  right-hand 
side  vector  b  is  again  the  product  of  A  and  the  vector  [1, 1, 1, 1,  l]r,  with  each  element 
perturbed  by  a  uniform  random  sample  from  [0, 1]. 

Matrix  III.  The  matrix  A,  again  50  x  5,  is  the  product  of  three  matrices,  W,  V ,  and  D. 
V  is  the  50  x  5  matrix  in  which  element  (i,j)  has  the  value  (i  —  D  is  the 

diagonal  matrix  which  normalizes  each  column  of  V.  W  is  a  matrix  which  weights 
row’s  15,  20,  25,  . . . ,  50  by  a  factor  of  100.  The  right-hand  side  vector  b  is  constructed 
by  multiplying  A  by  a  vector  x  =  [3 04 , 103,  100, 10,  l]r. 

Figures  1,  2,  and  3  show  data  about  the  condition  of  the  three  matrices.  Each  figure 
shows  the  condition  of  the  window  matrix  for  each  step  of  the  sliding  window  process. 

The  remaining  figures  show  the  performance  of  the  methods  on  the  three  test  matrices. 
Two  types  of  comparisons  are  made.  First,  the  rank- A:  CSNE  and  UNPACK  methods 
are  compared  to  the  rank-one  methods  presented  in  [BPE92].  Second,  the  various  rank- A; 
methods  are  compared  to  each  other.  The  Gram-Schmidt  methods  of  Section  2  and  the 
CSNE/LINPACK  methods  of  Section  3  are  compared  as  separate  groups,  and  then  the  best 
methods  from  each  group  are  compared  to  each  other. 

First,  consider  the  rank-/:  CSNE  and  UNPACK  methods.  Bjorck,  Park  and  Elden 
[BPE92]  give  results  which  show  that  in  the  rank-one  case,  for  tests  involving  ill-conditioned 
matrices,  the  CSNE  method  outperforms  the  UNPACK  method.  Figures  4  and  5  show 
that  this  continues  to  be  the  case  for  the  rank-/:  methods.  In  addition,  the  rank-two  and 
rank-three  methods  perform  at  least  as  well  as  the  rank-one  methods  for  the  tests  presented 
here.  It  is  particularly  interesting  to  note  that  the  error  in  the  UNPACK  method  for  Matrix 
I  goes  down  by  orders  of  magnitude  as  the  rank  increases:  the  rank-two  CSNE  method  is 
also  much  better  than  the  rank-one  CSNE  method  on  this  matrix,  although  the  rank-three 
CSNE  method  does  not  improve  much  on  the  accuracy  of  the  rank-two  method. 


Hi 


Condition  of  Window  Matrix  Condition  of  Window  Matrix 


Figure  1:  Condition  of  the  Window  Matrix  -  Matrix  I 


Figure  2:  Condition  of  the  Window  Matrix  -  Matrix  II 


17 


16 


Figure  3:  Condition  of  the  Window  Matrix  -  Matrix  III 

Figures  6  and  7  compare  the  five  Gram-Schmidt  based  methods  presented  in  this  paper 
(Classical  GS,  Modified  GS,  the  Small  QR  and  Small  Cholesky  factorizations,  and  the  Resid¬ 
ual  QR  method).  The  methods  are  comparable  on  Matrix  I.  On  Matrix  II,  the  two  “Small” 
factorization  methods  do  not  perform  nearly  as  well  as  the  others  (the  QR  method  seems  to 
be  better  than  the  Cholesky  method,  as  predicted).  The  other  methods  are  comparable  to 
one  another  and  are  in  the  same  range  as  the  CSNE  method.  However,  it  should  be  noted 
that  the  CGS  method  required  one  re-orthogonalization  for  all  but  one  of  the  columns  of 
Matrix  II. 

A  comparison  of  the  rank-two  CSNE,  MGS,  and  CGS  methods  (Figure  S)  shows  that  these 
methods  perform  comparably  on  Matrix  II,  giving  results  similar  to  a  QR  decomposition. 
However,  as  noted  in  [Bjo87],  the  CSNE  method  has  problems  when  dealing  with  matrices 
which  are  “weighted”,  that  is,  in  which  rows  have  been  multiplied  by  a  weight  constant  which 
gives  one  row  a  significantly  higher  norm  than  others  around  it.  Matrix  III  was  chosen 
because  it  is  an  example  of  an  ill-conditioned,  weighted  matrix.  Figure  9  shows  that  the 
CSNE  method  breaks  down  when  the  window  includes  the  first  weighted  row.  The  Residual 
QR  method  also  breaks  down,  but  the  CGS  and  MGS  methods  both  closely  approximate 
the  results  obtained  by  performing  a  full  QR  decomposition.  The  CGS  method  required 
re-orthogonalization  for  each  column  of  Matrix  III.  From  these  figures,  we  conclude  that  the 
CGS  method  with  re-orthogonalization  is  the  most  stable,  although  it  is  expensive  in  terms 
of  computation  and  storage.  The  MGS  methods  is  less  expensive  and  performs  comparably 
in  all  the  cases  shown  here.  The  CSNE  method  performs  nearly  as  well  in  some  cases  with 
lower  storage  and  computation  costs. 
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Rank- 3  UNPACK 
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k  Rank -2  CSME 
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Figure  6:  Rank-two  Gram-Schmidt  Methods,  Matrix  I 


Figure  7;  Rank-two  Gram-Schmidt  Methods,  Matrix  II 
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Abstract 

In  this  note,  we  propose  an  implicit  method  for  applying  orthogonal  transformations  on  both 
sides  of  a  product  of  upper  triangular  2x2  matrices  that  preserve  upper  triangularity  of  the 
factors.  Such  problems  arise  in  Jacobi  type  methods  for  computing  the  PSVD  of  a  product  of 
several  matrices,  and  in  ordering  eigenvalues  in  the  periodic  Scaur  decomposition. 

Introduction 

The  problem  of  computing  the  singular  value  decomposition  (SVD)  of  a  product  of  matrices  have 
been  considered  in  [1],[2],  [3],  [10].  The  computation  proceeds  in  two  stages.  In  the  first  stage  the 
matrices  are  transformed  into  the  upper  triangular  forms.  In  the  second  iterative  stage  an  implicit 
Jacobi-type  method  is  applied  to  the  triangular  matrices.  It  is  important  that  after  each  iteration 
the  matrices  stay  triangular  [8]. 

A  crucial  aspect  in  such  implicit  Jacobi  iterations  is  the  accurate  computation  of  the  PSVD  of 
a  product  of  2  X  2  triangular  matrices.  There  two  conditions  have  to  be  satisfied  [2].  First,  one 
has  to  ensure  that  the  orthogonal  transformations  applied  to  the  triangular  matrices  must  leave 
the  matrices  triangular,  and  second,  that  the  transformations  diagonalize  the  product  accurately. 
It  was  shown  in  [1]  and  [2]  that  these  two  conditions  are  satisfied  by  a  so-called  half-recursive  and 
direct  method,  respectively,  for  computing  the  SVD  of  the  product  of  two  matrices. 

In  this  note  we  analyze  an  extension  of  the  half-recursive  method  for  computing  the  SVD  of  the 
product  of  many  2x2  triangular  matrices.  We  also  show  that  the  extension  of  the  half-recursive 
method  can  be  used  for  swapping  eigenvalues  in  the  periodic  Schur  decomposition  described  in  [4]. 
For  simplicity  we  assume  real  matrices  and  real  eigenvalues,  but  all  results  are  easily  extended  to 
the  complex  case. 
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Criterion  for  numerical  triangularity 

Suppose  we  are  given  k,  k  >  1,  upper  triangular  matrices  A,,  i  —  1,2 ...,k. 


'■=(o  i) 


We  denote  the  product  of  ,4,,  i  =  l,2...,fc,  by  A, 


A  =  Ai 


<1 


Let  the  orthogonal  matrices  Q\  and  Qk+ 1  be  such  that 


4'  =  QtAQl„  =  (“'  *',) 


is  upper  triangular.  In  case  we  are  interested  in  finding  the  Singular  Value  Decomposition  of  A ,  one 
imposes  the  additional  condition  that  b'  =  0.  This  defines  uniquely  the  above  decomposition  up 
to  permutations  that  interchange  the  diagonal  elements  of  A' .  In  case  we  are  interested  in  finding 
the  Schur  Form  of  A,  one  imposes  the  additional  condition  that  Q\  =  Qk+i  ■  Again,  this  defines 
uniquely  the  above  decomposition  up  to  the  ordering  of  the  diagonal  elements  of  A'.  In  both  cases 
the  transformations  Qi  and  Qk+i  are  thus  defined  by  the  choice  of  ordering  of  diagonal  elements  in 
the  resulting  matrix  A'.  Our  objective  now  is  to  find  orthogonal  matrices  Q:,  j  =  2,3,...,/:,  such 
that  t 

4  =  <?Miflf+l  =  (o  jj)  (22) 

are  meanwhile  maintained  in  upper  triangular  form  as  w-ell.  It  is  easy  to  see  that  if  abd  ^  0  then  for  a 
given  pair  of  orthogonal  transformations  Q  j  and  Qk+\  there  exist  unique  (up  to  the  sign)  orthogonal 
transformations  Qj,...,Qk  such  that  (2.2)  is  satisfied.  There  are  many  mathematically  equivalent 
strategies  of  determining  Qi,...,Qk ■  However,  as  it  was  shown  in  [1],  [2]  and  [3],  some  strategies 
may  produce  numerically  significantly  different  results  than  other  strategies.  We  will  consider  a 
particular  method  numerically  acceptable  if  the  triangular  matrices  after  transformations  have  been 
applied  to  them  stay  numerically  triangular  in  the  sense  described  below. 

Let  A  be  the  computed  A,  and  let  Q ;,  i  =  1,2,  ...,k+  1  be  the  computed  transformations. 
Define  _t  ~t 

A'-.=  Q,AQtw  =  (*,  j')  (2.3) 


A[  Q(A,Qj+  l-Q  )  ■ 


Let  c  denote  the  relative  machine  precision.  Assume  that  we  are  given  Qi  and  Qt+i  such  that 

|e'|  =  0(c||A||)  (2.5a) 

We  will  say  that  A\  is  numerically  triangular  if 

|e;i  =  0(t|M,||),  (2.5b) 

We  will  propose  a  method  for  computing  nearly  orthogonal  Qi,  i  =  2 for  which,  under  a 
slightly  stronger  version  of  the  assumption  (2.5a),  the  (2,1)  element  e/  of  A[  will  satisfy  (2.5b). 
Condition  (2.5b)  justifies  truncating  the  (2,1)  element  e'  of  A\  to  zero.  Thus,  c'  is  also  forced  to 
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The  Algorithm 

Our  algorithm  is  a  generalization  of  the  algorithms  presented  in  [1]  and  [3]  for  computing  the  PSVD 
of  two  and  three  matrices  respectively.  There  the  orthogonal  transformations  all  had  the  form 

«  =  <31> 

where  c2  +  s2  =  1.  As  we  will  build  on  the  results  presented  in  those  papers  we  retain  this  particular 
choice  of  orthogonal  transformations.  While  each  transformation  Qi  is  defined  by  the  cosine-sine 
pair  c,  =  cos  0,  and  s,  =  sin  0,,  we  also  associate  Qi  with  the  tangent 

t,  =  tan  0,  . 

Given  t;,  we  can  easily  recover  c,  and  s,  using  the  relations 

c.  —  — _ 1 -  and  S{  —  fiC{  (3.2) 

\A+? 

Following  the  exposition  in  [1],  [3],  we  consider  the  result  of  applying  the  left  and  right  transforma¬ 
tions  Qi  (for  the  outer  left  transformation)  and  Qr  (for  the  outer  right  transformation)  to  a  2  x  2 
upper  triangular  matrix  A: 


A'^QiAQj^ 

b'\  _  /  si  c/Wa  b\  /  sr  cTy 
d! )  \-C,  Sl)\  0  d)\-cr  Sr) 

(3.3) 

We  can  derive  from  (3.3)  these  four 

relations: 

e'  =  cicr(—atr  +  dtf  -  b)  , 

(3.4a) 

b‘ 

1  =  cicT{-ati  -f  diT  +  btitr)  , 

(3.4b) 

a'  =  cicr[bt[  +  d  +  aiitr)  , 

(3.4c) 

d!  -  cicT(a  -  btr  +  dtitT)  , 

(3.4d) 

where  t\  =  tail  ^  and  tr  =  tan#r. 

The  postulates  that  both  e'  and  b'  be  zeros  define  two  conditions  on  t[  and  t,.,  so  that  (3.3) 
represents  an  SVD  of  A  [5].  The  postulate  that  e'  be  zero  and  ti  =  tr  represent  conditions  for 
swapping  eigenvalues  of  A. 

The  postulate  that  e!  be  zero  defines  a  condition  relating  6(  to  9r ,  so  that  if  one  is  known  the 
other  can  be  computed  in  order  to  reduce  A'  to  an  upper  triangular  form.  For  ease  of  exposition, 
we  assume  for  now  on  that  abd  ^  0.  It  implies  that  c/cr  0,  and  so  the  postulate  that  e'  =  0  in 
(3.4a)  becomes 


-atT  +  dti  -  b  =  0  . 

(3.5) 

The  consequence  of  (3.5)  is  that  (3.4c)  and  (3.4d)  simplify  to 

a1  =  CjCT(tJ  +  1  )d 

(3.0a) 

and 

d!  =  cicT{t2r  +  l)a  , 

(3.0b) 
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respectively. 

Assume  that  Qi  =  Q\  and  Qr  -  Q^  +  )  are  given,  that  is  t.  -  tt  and  t,  =  tkli  are  known  W«- 
will  use  relations  of  the  type  (3.5)  with  f<  and  tr  as  the  reference  tangents  to  compute  the  nun  ami:,  g 
transformations. 

Our  algorithm  can  be  described  recursively  as  follows.  We  split  the  sequent e  A,.  A  . A, 

into  two  subsequences  of  consecutive  mairices  At.  .-G, . AMi  and  A„.ti,  _ ■!,,)  wh.-ie  1  - 

m  <  k  +  1.  Let  us  denote 


b!)  =  n±  and  A,S 


0  </( 


ur  6, 
U  d r 


n  a,„. 

:  1  '  ~  '  l  =  w,f  J 

Suppose  that 

Ml  <  (Ml  . 

Then  we  propose  to  compute  ts  from  the  condition  (3  5)  by  the  forward  substitution, 

dtt-,  -  ln 

(ij 

Otherwise,  that  is  when 

i t;d\  >  |h«|  . 

we  propose  to  compute  from  (3.5)  by  the  backward  substitution. 

arlr  t  br 

‘  ST-  ' 


t3  ■ 


(3>b; 


Having  defined  the  first  step,  the  procedure  can  now  be  applied  recursively  to  generate  all 

the  remaining  orthogonal  transformations  Q,.  i  =  2 . k.  Note  that  there  is  a  lot  of  fn-edom  in 

splitting  the  sequence  At,  into  subsequent  subsequences.  Idas  might  be  advantageo-.j- 

for  a  divide-and-conquor  type  of  computation  in  a  parallel  environment. 

As  will  be  shown  later,  under  mrld  conditions  on  Q j  and  .  this  particular  way  of  generating 

orthogonal  transformations  Q„  i  -  2 . k,  will  guarantee  that  all  A[  will  be  nuinetically  upper 

triangular  in  the  sense  that  (2.5b)  will  be  satisfied. 


Error  Analysis 

In  our  error  analysis,  we  adopt  a  convention  that  involves  a  liberal  use  of  Greek  letters.  For  example, 
by  a  we  mean  a  relative  perturbation  of  an  absolute  magnitude  not  greater  than  <.  where  <  denote, 
the  machine  precision.  All  terms  of  order  <2  or  higher  will  be  ignored  in  this  first-order  analysis. 

The  function  f](a)  will  denote  the  floating  point  approximation  of  a.  For  the  purpose  of  th<- 
analysis,  a  “bar”  denotes  a  computed  quantity  which  is  perturbed  as  the  result  of  inexact  at  it  hmetir 
For  example,  instead  of  a,  6  and  d.  we  have  the  perturbed  values  «.  I>  and  d  which  result  from  float  inc 

point  computation  oi  !"]*=/  A,.  We  assume  that  exact  arithmetic  may  be  performed  using  tin . 

perturbed  values.  The  "tilde”  symbol  is  used  to  denote  conceptual  values  computed  exartlv  front 
perturbed  data. 

We  start  our  procedure  by  computing  elements  of  the  product  matrix  A  as  the  product  of  A 
arid  Ar  defined  by  (3.7): 

n  fli«;Hr)  ==  U(«r(l  +n),  f  Fla) 


•5 


d  :  =  fl(d/t/r)  =  d/dr(l  +i)  .  (-1.1b) 

b  fl(d/6r  +  Mr)  =  d/6r(  1  +  201 )  +  b[dt(  1  +  20j)  ,  (-Me) 

where,  according  to  our  convention,  the  parameters  a,  6,  rlj,  and  03  are  all  quantities  whose 
absolute  values  are  bounded  by  i. 

Now  we  specify  the  condition  that  we  impose  on  the  computed  Q\  and  Qk  +  \- 


Assumption  I:  Throught  the  rest  of  this  note  we  will  assume  that  the  computed  tangents  t>  and  i, 
corresponding  to  the  outer  transformations  Qi  —  Q\  and  Qr  =  Qk+i  satisfy  the  following  equality 

a(l  +  C\’)tr  -  J(1  +  C<p)U  +  6(1  +  C'x)  =  0  ,  (-1  .’2a ) 


where  C  =  C{k). 


□ 


Lemma  4.1:  The  recurrence  (3.8a)  yields  t,n  such  that 

«/(  1  +  2(,’i  )t,n  —  di(  1  +  01  )h  +  6;  =  0  . 
Likewise,  the  recurrence  (3.8b)  yields  t,n  such  that 

dr  (  1  +  Qi)tm  ~  dr(  1  +  2  C'2 )  b  —  6r  =  0  . 


(4.3) 

(4.4) 


□ 

Proof.  The  proof  easily  follows  from  (3.8a)  and  (3.8b). 

□ 

Theorem4.2:  If  |f;r/[  <  |frd|  and  if  tm  is  computed  via  (3.8a)  then  t,n  satisfies  the  relation 

ar(l  +  CiX'i )/  r  —  dT{  1  +  C’i<Pi +  bT(  1  +  C'iX'i)  —  0  (4.  on) 

where  C'i  =  Ci{k).  Likewise,  if  |//J|  >  | tr a |  and  if  im  is  computed  via  (3.8b)  then  tm  satisfies  t lie- 
relation 

d/(  1  +  CVVy )(~m  —  di(  1  +  Crd> r )U  +  6/(1  +  CT\T )  =  0  (4.56) 

where  CT  =  CT(k). 

Proof.  We  give  a  proof  of  the  relation  (4.5a)  only  as  the  relation  (4.5b)  can  be  proved  in  an 
analogous  way. 

First  from  (4.3a)-(4.3b)  we  get 

6/(  1  +  2 c 'i ) tm  —  di{  1  +  0i  )U  +  6/  =  0  ,  (4.0a ) 

wdiile  from  Assumption  I  and  (4. la)-(4.1c)  we  have 


d/dr(  1  +  o  +  Cip)tT  -  didT{  1  +  6  +  C0)/,+ 
d/6r(  1  +  20\  +  C  \)  +  6,dr(  1  +  202  +  C  \ }  =  0  ■ 


(4.b!jj 
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we  obtain 


or,  since  a;  ^  0, 


ai{aT(  1  +  a  +  Ct/0*r  -  ar  r—  ](6  +  C<p-  4> i  -  202  -  C\)f(  + 

\a/a  r  / 

6r(  1  +  2/3j  +  C'x)  —  dT(l  +  202  +  Cx  +  2V'i)*m }  =  0  , 


ar{  1  +  o  +  Cil>)ir  —  dr£r  {  1  (<5  4-  C4>  ~  0i  4-  2/3-2  +  C  \)+ 


6r(l  +  2/3j  +  Cx)  —  ^r(l  +  202  +  2y,,j  +  C\)tm  —  0  . 
As  we  assumed  that  |t;rf|  <  |fra[,  the  above  can  be  rewritten  as 

flr(l  +  Crftfr  —  dr(  1  +  Ci4>i)tm  +  6r(l  +  CiXi)  =  0 

where  C’i  =  Ci(k)  completing  the  proof. 


We  now  justify  why  the  (2,1)  element  in  the  computed  matrix  A[  can  be  set  to  zero.  Let  the 
cosine  and  sine  pairs  c,  and  5,  satisfy  t,  =  s,/c,,  for  i  =  /,m,r.  From  (4.2)  we  can  derive  that 


c,  fl(c,)  =  c,(l  +  3/i.)  , 


(l.Sa) 


h  :=fl(s,)  =  s,(l  +4//,).  (4.8b) 

Let  A\  denote  the  exact  updated  matrix  derived  from  A,,  i  =  /,r,  and  c,,  5,,  i  =  /,  m,  r  that  is 


S|  C/ 
-C;  5/ 


«/  &/ 

0  rf, 


ur  \  /  5r  cr 
0  dr  )  \  CT  Sr 


Our  next  result  is  a  direct  consequence  of  Theorem  4.2  and  provides  bounds  on  the  elements 
cj,  i  =  l,r,  defined  by  the  relations 


C/  . —  cismat  T  $[C,n d(  C[Crnb[  , 
cr  . —  c771srctr  4-  srncrdr  cTcrbr  . 


(5.1()a) 

(5.10i>) 


Corollary  4.4:  If  | i/d\  <  |fr«|  and  if  tm  is  computed  via  (3.8a)  or  if  |(/d|  >  | lr « |  and  if  t,n  is 
computed  via  (3.8b)  then 

|e'|  <  A'1c|jA,j|  ,  for  i=l,r.  (4.11) 
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Proof.  We  prove  the  corollary  for  the  case  when  |t/d|  <  |trd|  and  when  t,n  is  computed  via 
(4.8a).  The  other  case  can  be  proved  in  an  analogous  manner. 

Using  (4.3a)  we  can  rewrite  (4.10a)  as 

€[  —  cisinQ>i  4~  sicmdi  e/Cj^t/"!" 

c/cm(d/(l  +  2t/’j )tm  -  J,(l  +  d>])i)  +  bi)  ( 4 . 12 j 

from  which  it  follows  that 

131  <  A>P,II  • 

Similarly,  using  (4.5a)  we  can  rewrite  (4.10b)  as 


e 


/ 

r 


Cm^r^r  4"  CrCr6r  + 


c/cm(dr(l  +  Ciin)tT  -  dT(  1  +  C[4>i)tm  +  ir(l  +  CWr)) 


and  thus 


completing  the  proof  of  (4.10a). 


131  <  AVI  Mr  1 1  , 

□ 


(4.1 3) 


Numerical  examples 


The  SVD  algorithms  for  2  x  2  upper  triangular  matrices  in  [1],[2]  or  (5j  give  t\  and  tT  which  satisfy 
Assumption  I.  We  will  illustrate  that  by  using  our  new  scheme  triangularity  of  te  transformed 
factors  is  preserved. 

Consider  the  case  of  three  matrices  in  the  product  Assume  that  the  given  data  matrices  are 


(  2.316797292247488e  +  00 

V  0 

/  1.222222234444442c  +00 

V  0 

f  2.222222211  lllllle  -  01 

V  0 


-1.437687878748196c  -  01  \ 
-2. 718295063593277c  -02  J 

3.480474357220011c  -  01  \ 
5.674165405829751e  +  00  J 

1. 732050807568877c  +  00 
1.1111111 10000000c  -  12  J 


They  generate  the  matrix  product  A  A\  ■  A2  ■  A 3 


f  6.292535886949669e  -  01  4.90454G3G3614013e  +  00  \ 

^  0  —1.713783977472744c  —  \3J 


We  are  ‘crested  in  computing  orthogonal  transformations  Qj,  Q2 ,  Q 3  and  Q4  which  satisfy^‘2.2) 
and  (2.  ith  the  (1,2)  element  zero.  The  SVD  algorithm  for  the  2x2  upper  triangular  matrix  A 
in  [1]  or  »•  gives  <,  =  3.437688760727056c  -  14  and  t4  =  -7.79422867303 1074c  +  00  which  satisfy 
Assumption  I.  In  fact  we  have 

-  -  .  _  / -2.18090925306791  ie  -  14  -7.4941785995996I2c  -  30  \ 

QiAQt  -  ^  Q  4.944748235423613c  +  00  J 


We  split  A  into  the  product  of  A\  2  —  A1A2  and  A3.  We  note  that  the  ratio 

—  =  1.201223412093697c  -  27 

It  a 
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If  we  compute  <3  from  t\  as  indicated  by  the  ratio,  and  next  t2  as  specified  by  (3.8a)  or  (3.8b) 
then  Corollary  5.4  will  guarantee  that  the  transformed  factors  will  stay  (numerically)  triangular. 
Suppose  however  that  we  compute  f3  from  1 4  and  next  t2  from  t3.  Then  Lemma  4.1  will  guarantee 
that  Q2A2Qj  and  Q^A^Q^  will  saty  numerically  triangular.  However,  for  the  computed  Q^AiQ1, 
we  have 

O  A  Ot  -  (  ~2  713066430028558e  ~  02  —1.68518838740240 lr  -  03  \ 

36010694 1575845* -04  2.32125378604G10G«' +  00  J 

which  cannot  be  considered  upper  triangular.  An  error  of  order  10~4  has  to  be  introduced  to 
truncate  the  (2,1)  element  in  QiA\Q2  so  it  becomes  upper  triangular. 
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1.  Introduction 

The  problem  of  reordering  eigenvalues  of  a  matrix  in  real  Schur  form  arises  in  the  compu¬ 
tation  of  the  invariant  subspaces  corresponding  to  a  group  of  eigenvalues  of  the  matrix.  A 
basic  step  in  such  reordering  is  to  swapp  two  neighboring  1  x  1  or  2  x  2  diagonal  blocks  by  an 
orthogonal  transformation.  Swapping  two  1  x  1  blocks  or  swapping  1  x  1  and  2x2  blocks  are 
well  understood  [3].  Swapping  two  2x2  blocks  poses  some  numerical  difficulties.  Recently. 
Bai  and  Demmel  [1]  have  proposed  an  algorithm  for  swapping  two  2  x  2  blocks  which  is  for 
all  practical  purposes  backward  stable.  In  this  note  we  describe  an  alternative  approach 
for  swapping  two  2x2  blocks  which  is  based  on  an  eigenvector  calculation.  It  appears  that 
the  method  guarantees  small  rounding  errors  in  the  (2,1)  block  of  the  transformed  4  x  4 
matrix  even  if  the  two  2x2  blocks  have  almost  the  same  eigenvalues. 


2.  Reordering  eigenvalues 


Assume  that  A  is  a  4  x  4  block  triangular  matrix. 
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where  An  and  A22  are  2  X  2  with  pairs  of  complex  conjugate  eigenvalues  Aj,  A]  and  A2. 
A2.  We  can  further  assume  that  A u  and  A22  are  in  the  standard  form, 
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We  want  to  find  an  orthogonal  transformation  Q  such  that 

X‘), 

where  An  and  A22  are  similar  to  An  and  A22  respectively. 

The  standard  form  implies  that  A2  =  o2  +  $2  '  i  is  the  eigenvalue  of  A2 2-  Thus  A(A2)  = 
A  —  A2  •  /  is  singular  as  its  (2,2)  diagonal  block  has  rank  1.  Now  one  can  find  a  sequence 
of  complex  Givens  rotations  such  that 
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where  denotes  a  complex  Givens  rotation  operating  in  the  plane  (i J)  introducing 
zero  at  the  position  marked  as  ( k )  on  the  right  hand  side  of  the  relation.  Let  G  = 

^34^12^23^12*;  Then  y  =  u  +  v-i  =  Gex,  where  n  =  (n, ,  u2, U3,  u.|]T  and  v  =  (tq,n2,t'3,n4]r 
are  real  vectors,  is  the  complex  eigenvector  corresponding  to  A2.  Hence 
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Moreover,  because  A22  is  assumed  to  be  in  a  standard  form,  u4  =  t'3  =  0.  The  similar¬ 
ity  transformation  Q  can  be  expressed  now  as  a  product  of  real  Givens  rotations  which 

triangularizes  the  matrix  [u  u].  More  precisely,  let  Q  =  ke  such  that 
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where  jj^  denotes  the  corresponding  rotation.  Then  Q  is  the  desired  similarity  transfor¬ 
mation. 


Numerous  numerical  tests  suggest  that  in  the  presence  of  rounding  errors  the  relative  error 
in  the  (2,1)  block  of  the  transformed  matrix  A  is  proportional  to  the  machine  precision. 
The  algorithm  can  be  extended  to  cover  the  case  of  swapping  diagonal  blocks  in  the  periodic 
Sell  11  r  form  [2]. 
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Abstract. 

In  this  paper  we  derive  a  unitary  eigendecomposition  for  a  sequence  of  matrices  which  we  call  the  periodic 
Schur  decomposition.  YVe  prove  its  existence  and  discuss  its  application  to  the  solution  of  periodic  differ¬ 
ence  equations  arising  in  control.  We  show  how  the  classical  QR  algorithm  can  be  extended  to  provide  a 
stable  algorithm  for  computing  this  generalized  decomposition.  We  apply  the  decomposition  also  to  cyclic 
matrices  and  two  point  boundary  value  problems. 

Key  words.  Numerical  algorithms,  linear  algebra,  periodic  systems,  K-cyclic  matrices,  tv  ^-point 
boundary  value  problems 

1  Introduction 

In  the  study  of  time- varying  control  systems  in  (generalized)  state  space  form  : 

f  Ek-Zk+i  -F^-Zk  +  Gk-  uk 
\  Vk  -  Hk-  ~k  +  Jk-Uk 

the  periodic  coejjicients  case  has  always  been  considered  the  simplest  extension  of  the  time-invariant  case. 
Here  the  coefficients  satisfy,  for  some  K  >  0  the  periodicity  conditions  Ek  =  Ek+h Ek  =  Fk-'-h  •  Gk  = 
Gk+I<,  Ek  —  Hk+k' ■  Jk  =  Jk+K-  The  last  few  years  there  has  been  a  renewed  interest  in  the  area  because 
such  systems  arise  naturally  in  multi-rate  sampling  of  continuous  time  systems  [I],  Several  papers  were 
devoted  to  the  algebraic  structure  of  periodic  discrete  time  systems  and  it  appears  that  a  lot  of  the  algebra 
indeed  carries  over  from  the  time-invariant  case  [9].  For  period  K  =  1  one  has  the  time  invariant  case 
Ek  =  E,  Fk  =  F,  Gk  —  G,  Hk  —  H ,  Jk  =  J,  and  it  is  well-known  that  the  generalized  eigenvalues  of 
particular  pencils  derived  from  these  matrices  then  determine  the  behaviour  of  these  difference  equations 
[13].  In  the  case  I\  >  1  one  can  derive  a  set  of  K  time-invariant  subsampled  systems  [2],  [9]  that  describe* 
the  behaviour  of  the  periodic  system.  Problems  of  pole  placement,  optimal  control  and  robust  control  can 
then  be  solved  via  these  K  subsampled  systems. 
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During  the  last  few  decades  linear  algebra  has  played  an  important  role  in  advances  being  made  in 
the  area  of  systems  and  control  [16].  The  most  profound  impact  has  been  in  the  computational  and  im- 
plementational  aspects,  where  numerical  linear  algebraic  algorithms  have  strongly  influenced  the  ways  in 
which  problems  are  being  solved.  The  most  reliable  numerical  linear  algebra  methods  proposed  for  partic¬ 
ular  control  problems  are  related  to  particular  eigenvalue  and  singular  value  decompositions  of  “special" 
matrices,  such  as  special  Schur  decompositions  for  solving  Riccati  equations  [10],  [14],  Here  we  present 
a  new  decomposition  called  the  periodic  Schur  form  that  has  important  application.1  in  control  theoretic 
problems  of  periodic  systems.  We  present  a  few  of  these  applications  and  predict  that  several  other  uses 
will  be  found. 

The  decomposition  has  also  a  direct  application  to  A'-cyclic  matrices  and  pencils,  which  occur  in  the 
study  of  Markov  chains  and  the  solution  of  two  point  boundary  value  problems.  We  show  how  the  periodic 
Schur  form  naturally  decomposes  the  underlying  nxn  matrix  problem  into  n  scalar  poblems  with  the  same 
structure.  This  can  then  directly  be  used  for  the  solution  of  Markov  chains  and  two  point  boundary  value 
problems  in  an  elegant  manner.  The  relation  with  A'-cyclic  pencils  also  allows  to  completely  characterize 
the  singular  matrix  case  and  give  conditions  for  the  existence  of  solutions  in  the  singular  case. 

2  Periodic  Schur  decomposition 

Consider  the  set  of  (homogenous)  difference  equations 

Bt  +  l  •  Xj ,  i  —  1,...  (2) 

with  periodic  coefficients  A,  =  A,+k ,  B ,  =  B,+k-  For  period  K  =  1  one  has  the  constant  coefficient  case 
.4,  =  Ay  Bt  —  B  and  it  is  well-known  that  the  generalized  eigenvalues  of  the  pair  A,  B  yield  important 
information  about  the  system  (2).  When  K  >  1  one  derives  from  (2)  a  set  of  K  time  invariant  systems 
which  describe  completely  the  behavior  of  (2).  For  simplicity  we  first  assume  all  B,  to  be  invertible.  Then 
define  the  matrices  5,  =  Bfl  A,  yielding  the  system  : 

ii-fi  —  Bt  Ai  ■  x,  =  Si  •  x,,  i=l,...  (3 ) 

which  is  an  explicit  system  of  difference  equations  in  x,,  again  with  periodic  coefficients  5,  =  Sl+;< . 

One  can  now  consider  subsampled  systems  which  describe  the  evolution  of  (3)  over  K  steps,  and  since  the 
coefficient  matrices  of  (3)  are  A'-periodic,  one  may  expect  these  subsampled  systems  to  be  time  invariant. 
Indeed,  defining  the  matrices 

s{k)  =  Sk+K-i'...Sk+1-Sk,  k  =  1, ...,  A'-  (4) 

then  one  obtains  from  (3),  (4)  the  set  of  Ii  subsamplcd  systems  : 


xl+{i+l)/\ 

=  S(1)  -  xl+iKi 

i  -  0,1,2,. 

*2+{t+l)A' 

=  S(2)-x2+1 :k. 

i  =  0,1,2,. 

xA'+(.+l)ft- 

=  S(A)  •  Xh+,K, 

O 

II 

One  easily  checks  that  the  above  set  of  difference  equati  ms,  initialized  with  the  vectors  x,,  i  =  1,...,A' 
yields  the  same  solution  as  (3).  In  order  to  describe  the  behaviour  of  these  systems  one  thus  requires  the 
eigenvalues  and  eigenvectors  of  the  periodic  matrix  products  S^K  It  is  known  from  similar  decompositions 
[1 1],  [4],  that  explicitly  forming  the  matrices  ought  to  be  avoided  if  possible.  An  implicit  decomposition 
of  these  matrices  is  now  obtained  in  the  following  theorem. 
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Theorem  1  Let  the  matrices  A ,,  B, ,  i  =  1,...,A’  a//  t*  x  7i  arid  complex.  Then  there  exist  unitary 

matrices  Qt,  Zx,  i  =  1,...,  A'  such  that  : 


B  i 

=  Z\  ■  B\  ■  Q  2 

Ai 

=  Z\  ■  A\  ■  Q\ 

B-2 

=  Z-2  -b2  -q3 

M 

=  z.;  •  a2  ■  Qi 

Bk -i 

=  Zk_}  ■  Bk- i  •  Qk 

Ak- i 

—  Zk- i  •  j4a'-i  •  Qk-i 

Bk 

=  ZmK  ■  Bk  ■  Q  i 

Ak 

=  Z'k  ■  Ak  •  Qk 

where  now  all  matrices  Bi ,  /l,  are  upper  triangular.  Moreover  if  the  matrices  Bt  are  invertible  then  each 
Qi  puts  the  matrix  in  upper  Schur  form,  i.e.  Q*S^Qi  is  upper  triangular. 

Proof  ;  Because  of  its  simplicity  and  constructive  derivation,  we  give  here  a  simple  proof  assuming  all 
matrices  A{  and  Bt  are  non-singular,  except  possibly  A\.  The  more  complex  case  of  singular  matrices  is 
proven  in  section  3.2. 

If  all  matrices  B ,  are  invertible  then  all  matrices  exist.  Compute  the  upper  Schur  form  of  S*1)  : 

=  S(1K 

This  defines  the  matrix  Q i  and  one  can  thus  consider  the  matrix  Bk  •  Q i  and  its  QR  decomposition  : 

Zk  •  Bk  =  [BaQi] 

which  defines  the  unitary  factor  Zk  and  upper- triangular  factor  Bk-  In  turn,  one  then  considers  the  matrix 
Z'K  ■  Ak  and  its  RQ  decomposition  (i.e.  dual  to  the  QR  decomposition)  : 

AkQk  =  [ZkAk) 

which  defines  the  unitary  factor  Qk  and  upper-triangular  factor  Ak-  Repeating  this  for  all  subsequent 
matrices  defines  : 

•  Z,  and  Bl  from  the  QR  factorization  of  B,  ■  Q1+1  for  i  =  A’, 1  and 

•  Qi  and  At  from  the  RQ  factorization  of  Z*  ■  A ,  for  i  =  A',  ...,2. 

Notice  that  each  of  these  decompositions  in  fact  corresponds  to  one  of  the  equations  in  (C),  starting  from 
bottom  to  top.  By  now  all  transformation  matrices  Qi  and  Z,  are  defined  but  we  have  not  proved  that  the 
last  matrix  A\  is  upper-triangular,  since  in  the  equation 

A\  —  Z\  •  A\  ■  Q\ 

the  matrix  Q]  was  already  defined.  But  consider  now  the  product 

Q\s{l)Qi  =  {Q\b^zk)[ZkAkQk}--^QIB^z2}[Z‘2a2q2}[qib;1z1}[z;aiQ]}  (7) 


or 

5{,)-  B-h>AK---Bf'A2Bf'{Z\AlQl).  (8) 

Now  since  all  “hat”  matrices  in  both  sides  of  equation  (8)  are  upper-triangular  and  invertible,  this  must, 
also  hold  for  the  matrix  A\  =  Z\A\Q\.  This  completes  the  constructive  proof  of  the  existence  of  (C). 

Notice  that  the  proof  shows  how  to  derive  all  matrices  Q,  and  Z,  from  just  one  of  them.  Moreover,  by 
periodically  interchanging  the  products  in  (7)  one  easily  sees  that  also 


Q*S(,)Qt  =  S(I>  =  •  •  •  Br1  A ^  Bk1  Ak---  B,+ ,  *+i  Bf'  A, 


is  u]>per  triangular  and  hence  a  Schur  decomposition.  So  all  Schur  forms  are  actually  dependent  on  one 
another  via  (6).  ■ 
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Corollary  1  Let  the  matrices  A,,  Bx,  i  —  be  all  n  x  n  and  real.  Then  there  exist  orthogonal 

matrices  Qx,  Z,,  i  —  such  that  the  above  decomposition  (6)  holds  and  all  but  one  of  the  matrices 

Bt,  A{  are  upper  triangular.  This  last  one  is  in  quasi-upper  triangular  form  with  lxl  and  2  x  2  diagonal 
blocks. 

Proof  :  Assume  that  all  matrices  are  invertible  except,  say,  A\  (see  section  3.2  for  the  general  case).  The 
proof  then  goes  as  before.  Pick  a  real  transformation  Q\  that  puts  in  real  Schur  form  5^  =  Q{ 

Then  perform  all  QR  factorizations  as  above  to  define  the  remaining  transformation  matrices  Zx,  i  =  A'...,  1 
and  Qi,  i  -  K,  ...,2  in  decreasing  order  (these  are  real  transformations,  of  course).  In  (8)  Br,  *  =  A', ...,  1, 
Ak,  i  =  K,...,2  (and  their  inverses)  are  upper  triangular,  and  is  quasi  upper-triangular.  From  this 
it  follows  that  Ai  must  be  of  the  same  form  as  If  one  would  have  started  the  definition  of  the 

transformations  Zi  and  Qi  from  the  other  side  (i.e.  the  QR  factorization  of  A\Q\  instead  of  Bj<Q j)  then 
Bjc  (and  its  inverse)  would  have  the  same  form  as  Finally,  by  starting  the  above  reasoning  with  a 

different  index  i  it  is  clear  that  one  can  pick  any  matrix  A,  or  Bt  to  have  the  quasi-triangular  shape.  It  is 
easy  to  move  it  around  as  well  via  a  “post-processing”  using  updating  Givens  rotations.  ■ 

In  fact  the  matrices  Qi  transform  the  vectors  x,  to  x,  =  Q’  ■  x,  and  the  difference  equations  (2)  to  the 
equivalent  system  : 

Z-BtQt+,  ■Q-+1il+,  =  Z*A,Q,<?*x„  i  =  1 ,  - . .  (10) 

or 

B,  •  xt+,  =  A,  •  x,,  i  =  1, . . .  (11) 

with  periodic  coefficients  At  =  A1+/v',  5,  =  B,+ a*  which  are  now  all  upper  triangular  (except  one  quasi 
triangular  one  in  the  real  case).  The  same  transformations  can  of  course  be  applied  to  the  non  homogenous 
case,  and  this  will  be  used  later  on. 

An  elegant  consequence  of  the  above  theorem  is  the  following  corollary. 

Corollary  2  All  per-odic  products  S ^  have  equal  eigenvalues  and  their  Schur  forms  S1-')  giVen  by  the 
implicit  decomposition  (6)  have  the  same  eigenvalues  on  diagonal. 

Proof  :  It  is  trivially  seen  that  5^  and  have  equal  eigenvalues  since 

5(,)  =  Mi  M2,  S(1)  =  M2A/i 

with 

M2  =  Sk-.-.-S,,  Mi 

Equality  of  spectrum  indeed  follows  immediately  from  this.  The  Schur  forms  of  the  matrices  will  thus 
have  the  same  diagonal  elements,  up  to  their  ordering.  But  the  Schur  forms  constructed  by  (6)  have  the 
additional  property  that  the  diagonal  elements  of  the  matrices  are  all  actually  equal.  Indeed,  they  are 
the  products  of  the  diagonal  elements  of  the  upper  triangular  matrices  B~l  A,.  So,  if  one  matrix  has 
a  particular  ordering  of  eigenvalues  then  all  other  matrices  .5’^  have  the  same  ordering  of  eigenvalues.  ■ 
We  give  in  the  next  section  an  algorithm  to  compute  the  above  decomposition  implicitly ,  i.e.  without 
ever  forming  the  products  S^'K  Moreover  we  show  how  to  reorder  the  eigenvalues  of  these  Schur  forms. 
We  call  this  the  periodic  QR  algorithm  as  related  to  the  above  periodic  Schur  decomposition. 
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3  Periodic  QR  algorithm 

We  now  consider  the  computation  of  the  periodic  Schur  decomposition.  Here  we  will  not  require  the 
invertibility  of  the  matrices  A,-,  Bt.  In  order  to  have  a  periodic  QR  algorithm  we  need  the  following 
ingredients  to  make  the  algorithm  work  : 

1.  a  reduction  to  some  kind  of  Hessenberg  form 

2.  a  direct  deflation  of  the  singular  case 

3.  a  shift  calculation  procedure 

4.  a  method  for  performing  QR  steps 

5.  a  procedure  for  reordering  eigenvalues. 

In  the  above  list  one  should  try  to  do  as  much  as  possible  implicitly,  i.e.  without  ever  constructing  the 
products  S^'K  Moreover  one  would  like  the  total  complexity  of  the  algorithm  to  be  comparable  to  the 
cost  of  K  Schur  decompositions,  since  this  is  what  we  implicitly  compute.  This  means  that  the  complexity 
should  be  0(A'n3)  for  the  whole  process.  Notice  that  this  indeed  precludes  the  construction  of  the  products 
.S'b)  since  this  would  already  require  0 (K2n3)  operations.  We  now  derive  such  implicit  solutions  for  each 
item.  Below  H(i,j)  denotes  the  group  of  Householder  transformations  whereby  (z,  j)  is  the  range  of 
rows/columns  they  operate  on.  Similarly  Q(i,  z  +  1)  denotes  the  group  of  Givens  transformations  operating 
on  rows/columns  z  and  i  -f  1. 

3.1  Hessenberg-triangular  reduction 

We  first  consider  the  case  where  all  Bi  are  the  identity.  We  thus  only  have  a  product  of  matrices  A,  and 
in  order  to  illustrate  the  procedure  we  show  its  evolution  on  a  product  of  3  matrices  only,  i.e.  A3A2A j. 
Below  is  a  sequence  of  “snapshots”  of  the  evolution  of  the  Hessenberg-triangular  reduction.  Each  snapshot 
indicates  the  pattern  of  zeros  (’O’)  and  nonzeros  (V)  in  the  three  matrices. 

First  perform  a  Householder  transformation  Q 3  €  on  the  rows  of  A2  and  the  columns  of  A3. 

Choose  Q 3  to  annihilate  all  but  one  element  in  the  first  column  of  A 2  ' 
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Then  perform  a  Householder  transformation  Q 1  €  H(l,n)  on  the  rows  of  A3  and  the  columns  of  A\. 
Choose  Qi  to  annihilate  all  but  one  element  in  the  first  column  of  A3  : 
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Then  perform  a  Householder  transformation  Q 2  €  ~H{2 , n)  on  the  rows  of  A\  and  the  columns  of  A2. 
Choose  Qi  to  annihilate  all  but  two  element  in  the  first  column  of  Ay  : 


X  X  X  X  X  X 
0  X  X  X  X  X 
0  X  X  X  X  X 
0  X  X  X  X  X 
0  X  X  X  X  X 


X  X  X  X  X  X 

0  X  X  X  X  X 

0  x  x  x  x  x 

0  X  X  X  X  X 

0  X  X  X  X  X 


X  X  X  X  X  X 


XXX  XXX 
0  X  X  X  X  X 
0  X  X  X  X  X 
0  x  x  x  x  x 


lOxxxxxJlOxxxxxJlOx  XX  xxj 

Notice  that  this  third  transformation  did  not  destroy  any  of  the  previously  created  elements  in  A 2  because 
it  did  not  transform  its  first  column.  A  similar  set  of  three  transformations  yields  the  following  three 
snapshots  : 


and  this  continues  until  we  reach  the  Hessenberg-triangular  form  : 


X  X  X  X  X  X 


0  X  X  X  X  X 

0  0  x  x  x  x 

0  0  0  i  i  i 

0  0  0  0  1  1 

0  0  0  0  0  1 


X  X  X  X  X  X 


0  X  X  X  X  X 
0  0  x  x  x  x 
0  0  0  x  x  x 
0  0  0  0  x  x 
0  0  0  0  0  x 


X  X  X  X 


X  X  X  X 


X  X  X  X 


X  X  X  X 


0  x  x  x 
0  0  x  x 


When  the  matrices  B{  are  not  the  identity  matrix,  one  starts  with  transforming  each  of  them  to 
triangular  form.  Then  one  proceeds  with  a  similar  reduction  procedure  for  the  matrices  A,  as  above. 
While  the  zero  elements  are  being  created  in  the  matrices  Ai  one  preserves  the  matrices  Bx  in  upper 
triangular  form  at  each  step.  Therefore,  one  can  not  make  use  of  Householder  transformations  anymore. 
Indeed,  applying  a  Householder  transformation  in  H(k,n)  (left  or  right)  to  a  triangular  matrix  B,  fills  it  in 
and  one  can  not  find  a  Householder  transformation  in  the  same  class  operating  on  the  other  side  of  B,,  that 
will  restore  its  triangular  shape.  On  the  other  hand,  this  is  easily  done  when  using  a  Givens  transformation 
in  {/{k,  k  +  1)  since  then  only  the  element  B,(k  +  1 ,  k)  fills  in  below  the  diagonal  and  this  can  immediately 
be  annihilated  again  using  another  Givens  transformation  in  G(k,k  +  1)  operating  on  the  other  side  of 
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Bj.  The  above  procedure  of  creating  zeros  in  A,,  while  maintaining  the  matrices  B,  in  upper  triangular 
form,  can  thus  go  through.  Notice  that  for  the  case  K  =  1  one  retrieves  exactly  the  Hessenberg-triangular 
reduction  of  the  QZ  algorithm  [  1 1] .  Operation  counts  for  this  Hessenberg-triangular  reduction  are  given 
in  section  5.1. 

3.2  Direct  deflation  of  the  singular  case 

In  this  section  we  show  how  to  perform  direct  deflations  in  the  Hessenberg-triangular  form  when  either  of  the 
pivot  elements  is  zero.  With  pivot  element  we  mean  the  elements  on  the  diagonal  of  each  triangular  matrix 
At,  i  =  2,...,  A',  B{,  i  =  1,...,  A'  and  below  the  diagonal  in  the  Hessenberg  matrix  A\.  Below  we  treat 
three  different  cases  and  show  how  direct  deflations  can  be  performed  to  yield  one  or  several  subproblems 
of  smaller  dimensions  where  now  all  pivot  elements  are  nonzero.  This  corresponds  to  subproblems  without 
eigenvalues  at  zero  or  oc. 

Case  1.  When  an  element  below  the  diagonal  of  A\  is  zero,  the  problem  trivially  decomposes  in  two 
lower  dimensional  problems,  as  shown  below  for  matrices  B2,  A2,  B\ ,  A\  where  the  (4,3)  element  in  A\  is 
zero  : 
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This  reduction  is  identical  to  what  happens  in  the  single  matrix  case  and  clearly  can  be  repeated  until  one 
obtains  smaller  dimensional  matrices  A\  with  non-zero  subdiagonals  (i.e.  unreduced  Hessenberg  forms). 
Moreover  the  reduction  does  not  involve  any  transformation  but  only  a  partitioning.  The  next  two  cases 
are  zero  diagonal  elements  in  any  of  the  remaining  matrices.  One  first  deflates  the  zeros  in  the  first  matrix 
in  the  sequence  B2,  A3,  S3, ...,  An,  Bk,  i.e.  one  first  treats  the  “closest”  matrix  to  A\. 

Case  2.  If  the  closest  matrix  to  .4j  with  zero  diagonal  elements  is  Au  then  the  partial  product 
AtB~^Ai-\...B±^  A\  again  decomposes  in  a  block  diagonal  matrix,  as  indicated  below  with  the  sequence 
A2B^ 4]  where  A2  has  a  zeio  diagonal  in  position  (4,4)  : 
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Moreover  the  bottom  block  is  rank  3  only  and  one  ought  to  be  able  to  extract  a  ze  u  eigenvalue.  We  no  a 
show  how  a  sequence  of  Givens  transformations  can  be  generated  to  obtain  a  deflated  and  decomposed 
form  of  the  type  : 


We  first  apply  the  row  transformation  Z\  —  G3.G-2.Gi  to  .4),  where  the  Givens  transformations  G,  c 
(7(1,2),  G-i  €  (7(2,3)  and  C3  €  (7(3,4)  are  chosen  to  annihilate  the  elements  Uj.  U;  and  U;)t  respectively,  a- 
given  below.  Propagating  those  through  the  intermediate  triangular  matrices  (here  only  li,  j  this  results  in 
the  column  transformation  Q>  =  G3.G4.Gs  applied  to  /tj,  where  the  Givens  transformations  G4  t  (7i  l.'.G 
and  G$  €  (7(2,3)  respectively  create  the  nonzero  elements  zA  and  z*  {(!,,  €  (7(3.4 i  docs  not  create  any 
element)  : 


Then  the  two  elements  z4  arid  z*  are  annihilated  again  by  Givens  transformation*  (>'-  €  (7(1  2}  am! 
6’g  €  (7(2, 3)  as  part  of  the  row  transformation  Z\  -  G».(  >'r  acting  on  A}  (this  yields  Or  and  Or.  rexpec  lively  . 
Propagating  these  through  the  intermediate  triangular  matrices  left  of  .4^  and  then  back  to  A  j .  this  results 
in  the  column  transformation  Q\  -  G&.G  10  acting  on  .4j.  Here  the  Givens  transformations  G\,  €  (7t  1.2- 
and  G'10  €  (7(2,3)  create  the  elements  x>,  and  Ji.>.  respectively  : 


This  subsequence  of  matrices  is  now  already  closer  to  the  desired  result.  The  next  steps  are  dual  to  tlm 
ones  above  and  are  just  indicated  below  by  the  sequence  of  annihilated  and  created  elements.  Just  as 
above,  everything  is  done  via  appropriate  Givens  rotations  : 
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and  finally  : 
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which  is  precisely  the  desired  form.  Notice  that  all  this  requested  about  u  (livens  rotations  on  each  side 
of  each  condensed  matrix.  As  a  result  a  zero  eigenvalue  was  deflated  and  moreover  a  block  reduction  a  as 
obtained  as  the  same  time  (see  section  5.1  for  more  details  on  the  operation  count). 

Case  3.  We  now  consider  the  case  where  the  closest  matrix  with  a  zero  diagonal  element  occurs  in  a 
matrix  Bx.  Without  loss  of  generality  we  may  assume  that  it  is  the  matrix  B  j.  since  we  can  always  associate 
the  subproduct  A,B~_}}  /t,_i  ...Z?,-1  Aj  with  the  matrix  A i  (this  subproduct  indeed  exists  and  is  mm-duced 
Hessenberg).  Below  we  thus  take  the  example  ...  B\ ,  .-1 1  where  /fj  has  a  zero  diagonal  in  position  f-l.-Si  : 
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We  first  perform  a  row  transformation  Z\  -  G'i  on  both  B\  and  ,4i  where  «'V1  £  (7(1, 5]  is  chosen  to 
annihilate  the  element  Oi  in  B\.  At  the  same  time  a  nonzero  element  i j  is  created  in  A\  : 
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Then  a  column  transformation  Q\  —  G>  with  G-j.  €  (7(3,1)  is  applied  to  A i  to  annihilate  the  element  r j 
again  (yielding  0^).  Propagating  this  over  all  triangular  matrices  back  to  B\  yields  a  column  transformation 
Qi  €  <z(3,4)  that  does  not  create  any  fill  in  : 
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After  this  step  the  B\  matrix  has  two  consecutive  zero  diagonal  elements.  The  next  pair  of  steps  move 
these  zero  diagonals  one  elements  down  while  keeping  Ai  Hessenberg.  First  apply  a  row  transformation 
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Z\  =  6';i  on  both  B\  and  A\  where  G':j  €  1/(5, 6)  annihilates  D(  in  B\  and  creates  x s  in  ,-4s  : 

l  x  i  z  z  z  1  [  z  z  t  l  It 

0  X  Z  Z  Z  Z  Z  X  Z  Z  l  l 

0  0  x  z  z  t  0  x  x  x  x  x 

0  0  0  0  x  x  0  0  x  x  xx 

0  0  0  0  0  x  0  0  0  x  xx 

00000  l)3  J  [  0  0  0  rj  r  i 

Then  apply  the  column  transformation  Q t  =  (>\  with  (>\  €  £(*1,5)  on  4S  to  annihilate  the  element  x. 
again  (yielding  O4).  Propagating  *  his  over  ali  triangular  matrices  back  to  B,  yields  a  column  transformation 
Qz  €  ^7(4,5)  that  creates  the  element  xA  : 

xxr  r  xx'Trxx  x  xx 
Oxxxxx  xxxxxx 

OOxxxx  Uxxxxj 

0  0  0  x,  r  x  OOxxxx 

0000  Ox  0  0  0  rxx 

0  0  0  0  0  0  [  0  0  O  0,  1  r 

With  the  two  consecutive  zero  diagonals  now  at  the  bottom  of  B\ ,  we  finally  apply  a  column  transformation 
Q 1  =  G$  with  C»$  €  £(5,t>)  on  A\  to  annihilate  its  bottom  off  diagonal  element  (yielding  0;,).  Propagating 
this  back  to  B\  yields  a  column  transformation  t  £(5,G)  that  creates  the  element  xr>  : 

X 
X 
0 
0 
0 
? 

The  above  form  can  now  be  deflated  as  indicated  above.  Notice  that  again  the  number  of  Givens  trans¬ 
formations  applied  to  each  matrix  is  at  most  of  the  order  of  n  for  one  deflated  eigenvalue  at  oc. 

Summary.  The  above  three  cases  indicate  that  any  zero  pivot  element  can  be  deflated  with  O in) 
Givens  transformations  per  matrix,  until  a  (set  of)  lower  dimensional  problem! s)  is  obtained  where  now  all 
triangular  matrices  are  invertible  and  /1j  is  unreduced  Hessenberg.  In  the  proof  of  Theorem  1  and  Corollary 
1  the  general  case  can  thus  be  ‘’pretreated”  by  the  Hessenberg-lrianguiar  reduction  followed  by  the  direct 
deflation  described  above.  Theorem  1  and  Corollary  1  can  then  be  applied  to  these  “nonsingular"  cases, 
which  implicitly  yields  a  proof  of  these  theorems  for  the  general  case  where  any  B,  or  At  may  be  singular, 
Moreover,  since  the  above  procedure  allows  us  to  reduce  the  general  problem  to  the  nonsingular  case,  we 
only  need  to  consider  this  simpler  case  in  the  sequel. 


3.3  Shift  calculation  and  QR  step  construction 

Since  we  have  now  a  Hessenberg-triangular  form  with  all  lower  order  matrices  invertible  and  unreduced,  the 
corresponding  products  BJ*  A^ ...B^1  AiB^1  A\  exist  and  are  unreduced  Hessenberg.  In  the  QR  algorithm 
applied  to  an  unreduced  Hessenberg  matrix,  the  shift  is  typically  computed  from  the  bottom  2x2  submatrix. 
For  the  above  sequence,  this  is  of  the  form 
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Notice  that  the  triangular  2  x  2  inverses  can  be  replaced  by  their  adjoint s  up  to  a  scalar  factor.  The 
eigenvalues  of  this  2x2  matrix  are  thus  easily  computed  and  are  used  for  calculating  the  shift  o!  the 
Q  R- step. 


The  transformation  Qi  of  the  QR  step  applied  to  t  lit-  Hessen  berg  matrix 


is  now  completely  defined  by  its  first  column.  In  the  case  of  a  single  shift  A,  this  first  column  has  oi.lv  w 
nonzero  elements,  corresponding  to  the  normalized  version  of  the  2-vector  : 
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Since  the  matrices  Q,  and  Z,  are  all  defined  by  one  another  through  the  constraint  that  updates  t.u 
Bt,  i  —  1...,  K  and  .4.,  i  =  2, ....  I\  must  be  upper  triangular,  one  could  as  well  compute  any  other  matm 
than  Qi.  It  turns  out  that  the  simplest  one  to  contract  is  Z\.  It  performs  a  QR  step  on  the  unreduced 
Hessenberg  matrix 

A„  =  .4Ii//;1.4;v.../1714,if-1 

arid  is  again  defined  by  its  first  column,  consisting  of  only  two  nonzero  elements.  Now  this  2-vertoi  is  the 
normalized  version  of  : 
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which  involves  much  less  computations. 


In  the  implicit  double  shift  one  determines  the  first  column  of  the  real  matrix  {Ah  -  Aj)(.4//  -  Ao 
where  Aj  and  A2  are  the  two  eigenvalues  of  (12).  In  order  to  avoid  complex  arithmetic  when  A,,  t  ~  1.2 
are  complex  conjugate  one  constructs  the  first  column  of  Aj{  -  s  ■  Ah  +  p  ■  /  wliore  s  =  (A5  •+  A/)  and 
p  =  A[.A2  are  real.  This  vector  has  only  three  nonzero  elements  and  is  up  to  a  constant  : 


JD 

au 

J  i) 

S2.l 

0 


„<l) 

“l,2 


4'i  j 


.1  l<  I 

°I  I 


(CO 

2 

,  <  K  I 


JK) 


J  K  1 

°  1 ,2 

AK) 

a2.7 


‘  lUi 
*1.1 

i  ( t  j 
*2 

—  i 

0 

lOl 

*2.2 

ai  1 

'  «W  ' 

P 

3 

a['l 

+ 

0 

0 

0  . 

i  ( 1 J 
*1.1 


( K  ) 
1.1 


a 


l  K ) 

1  1 


3.4  Periodic  QR  step 

Again  for  simplicity  we  only  consider  the  product  of  four  matrices  Bj1  AjB^1  A\  and  the  case  of  a  single 
shift  in  order  to  explain  the  general  idea.  The  first  three  matrices  are  upper  triangular.  The  last  matm 
Ai  is  upper  Hessenberg. 
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Apply  fust  Z\  €  (7(1,2)  to  annihilate  the  bottom  element  in  the  2-vet tur  determined  above.  Applying 
this  to  the  rows  of  Bt  and  Aj  yields  : 
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Then  construct  the  column  transformation  Q-j  €  (7(1,2)  to  annihilate  again  i\  in  B\  but  also  apply  this 
transformation  to  the  columns  of  Ay,  creating  x-2  : 
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Then  apply  the  row  transformation  Z2  €  (7(1,2)  to  B2  and  A2  annihilating  x2  bnt  creating  s:i  : 
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Finally  close  the  loop  with  the  column  transformation  Q -2  €  (7(1,2)  applied  to  By  and  At  to  annihilate 
again  i3  but  creating  a  “bulge”  x4  in  Aj  : 
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Repeating  this  process  chases  the  bulge  one  step  down  at  each  sequence  of  Givens  transformations,  until! 
it  finally  dissapears  at  the  bottom  of  the  Hessenberg  matrix  A\.  Basically  the  same  procedure  applies  to 
the  implicit  double  shift  for  real  matrices  except  that  then  the  bulge  chasing  transformations  are  3  x  3 
unitary  matrices,  realized  by  a  product  of  Householder  transformations  or  Givens  transformations. 

3.5  Reordering  eigenvalues 

We  assume  now  that  an  upper  triangular  decomposition  was  obtained  upon  convergence  of  the  above  QB 
steps  (blow  there  is  only  one  2x2  block  in  A i).  Then  we  want,  to  permute  the  two  (real)  eigenvalue- 
corresponding  to  the  diagonal  elements  and  x2  : 
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One  then  computes  the  product  of  the  corresponding  2  x  2  matrices  and  computes  from  there  the  requested 
updating  Givens  transformations  that  will  perforin  the  swapping.  Care  has  to  be  taken  to  implement  this 
in  a  numerically  stable  manner  as  was  e.g.  the  case  for  the  QZ  reordering  in  {11}.  This  especially  applies 
to  the  swapping  of  two  2  x  2  blocks  which  is  a  much  more  delicate  problem. 


4  Applications  of  the  periodic  Schur  form 


4.1  Periodic  control  systems 

The  application  of  this  decomposition  to  control  theory  is  apparent  Periodic  discrete  time  systems  natu¬ 
rally  arise  when  performing  multirate  sampling  of  continuous  time  systems  [lj.  In  optimal  control  of  such 
a  periodic  system  one  considers  the  problem  : 


Minimize  J  =  zjQ,z,  +  uj  R,u, 
subject  to  £jrl+)  =  F,z,  +  G',u, 


where  the  matrices  Q,,  Rt,  E,,  Ft,  0\  are  periodic  with  period  K.  The  Hamiltonian  equations  are  peri¬ 
odic  homogenous  systems  of  difference  equations  (2}  in  the  state  z,  and  co-state  A,  of  the  system.  The 
correspondences  with  (2)  are  : 


Xi  = 


-G,R~lGj 

Fj 


o  F,  ] 


(14) 


For  finding  the  periodic  solutions  to  the  underlying  periodic  Riccati  equation  one  has  to  find  the  stable 
invariant  subspaces  of  matrices  5*'*  as  above,  which  happen  to  be  simplectic  in  the  discrete  time  case  (one 
has  to  assume  here  that  £,,  F,  and  R,  are  invertible  and  eliminate  implicitly  E ,  [7]).  Clearly  the  Schur 
form  is  useful  here  as  well  as  the  reordering  of  eigenvalues  [10],  [1-1], 


In  pole  placement  of  periodic  systems  [9],  again  the  periodic  Schur  form  and  reordering  is  useful  w  hen 
one  wants  to  extend  Varga’s  pole  placement  algorithm  [17]  to  periodic  systems.  Consider  the  system 

Btz,+i  =  A,zt  +  D,u, 

(15) 

with  state  feedback  u,  =  F, z,  +  v, 

where  the  matrices  4,,  £?,,  D,,  Ft  are  periodic  with  period  K.  This  results  in  the  closed  loop  system 

5|2«+i  =  ( Ai  +  D,F,)zi  +  Dti>i  {10} 

of  which  the  underlying  time  invariant  eigenvalues  are  those  of  the  matrix  : 

4,}  =  B^(AK  +  DKFI<)-B^(A2FD2Fi)B-\Ai  +Z?,F,).  (17) 
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In  the  above  equation  it  is  not  apparent  at  all  how  to  choose  the  matrices  tt  to  assign  particular  eigenvalues 
of  SpK  Yet  when  the  matrices  A,,  E,  are  in  the  triangular  form  {(>),  one  can  chouse  the  l\  matrices  to 
have  only  nonzero  elements  in  the  last  column.  This  will  preserve  the  triangular  form  of  the  niatmo 
A ,  +  D, F,  and  it  is  then  trivial  to  choose  e.g.  one  such  column  vector  to  assign  one  eigenvalue.  In  uidei 
to  assign  the  other  eigenvalues  one  needs  to  reorder  the  diagonal  elements  in  the  periodic  Schur  form  and 
each  time  assign  another  eigenvalue  with  the  same  technique.  This  algorithm  will  of  course  fail  when  the 
periodic  system  is  not  controllable,  but  this  very  procedure  can  in  fact  be  adapted  to  precisely  const  nu  t 
the  controllable  subspace  of  the  periodic  system. 


4.2  K-cyclic  matrix  problems 

Here  we  consider  the  following  pencils  of  matrices  : 
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where  the  matrices  5,  =  B~l  A,  are  as  defined  earlier.  The  matrix  .5  is  now  known  as  a  A'-cyrlic  matrix, 
and  by  extension  we  will  call  XP  —  *4  a  A'-cyclic  pencil.  It  is  well-known  that  the  eigenvalues  of  5  are  the 
A’-th  roots  of  those  of  the  matrix  S1' ,  but  the  latter  is  easily  checked  to  be  block  diagonal  : 


SK 
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0  S(2)  0  0 
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where  again  the  matrices  5 ^  are  as  defined  earlier.  This  shows  the  relation  between  the  two  problems.  We 
now  show  that  the  decomposition  (6)  actually  yields  a  block  Schur  decomposition  of  the  above  pencil  as  well. 
Indeed  the  orthogonal  transformations  Z  =  diag{£*'*  Zj,  ...Z/v_]}  and  O  =  diag{Qj,  ...Q s-i,  Q l<] 
yield  a  pencil  Z*  •  (A B  ~  A)  ■  Q  which  after  appropriate  reordering  becomes  upper  block  triangular  witli  on 
diagonal  pencils  of  the  type  : 
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where  b)  indicates  that  the  element  belongs  to  the  triangular  matrices  A,  or  /),.  For  this  reason  i lu*  pomii 
A B  —  A  is  nonsingular  iff  ^  ?  0.  i.e.  iff  there  are  no  zero  by  zero  divides  in  two  const  cutivt  elements 

(in  a  periodic  sense)  on  the  diagonals  of  the  decomposition  (6). 

4.3  Two  point  boundary  value  problems 

In  the  solution  of  two  point  boundary  problems  (not  necessarily  periodic),  one  encounters  inversions  of 
matrices  of  cyclic  type  (B  +  A)x  =  u  where  A  and  B  are  as  above  ( 18).  Again  we  ran  apply  the  orthogonal 
transformations  Z'  and  Q  to  obtain  the  system  of  equations  Z'(B  +  A)Q(Q"  x)  —  Z'u  width  essentially 
decomposes  in  n  scalar  TPBV  problems.  The  big  advantage  of  this  is  that  increasing  and  decreasing 
solution  in  the  TPBV  problem  have  been  decoupled.  The  periodic  Schur  form  in  fact  “aligns"  stable  and 
unstable  solutions  at  each  step.  The  decomposition  could  also  be  computed  at  a  coarse  mesh  and  then 
“extrapolated”  at  finer  meshes  in  order  to  avoid  too  much  work.  This  is  still  under  investigation. 

5  Numerical  aspects 

The  use  of  Householder  and  Givens  transformations  for  all  operations  in  the  periodic  QR  algorithm  guar¬ 
antees  that  the  obtained  matrices  A,  and  Dt  in  fact  correspond  to  slightly  perturbed  data  as  follows  (indices 
are  taken  modulo  K)  : 

A,  =  Z‘(A,  +  6A,)Q„  Bt  =  Z*(D,  +  tD,)Q,+„ 

where  Q,  and  Z,  are  exactly  unitary  matrices  and  where  \\Q,  -Q,||,  ||Z,  -  Z,||,  |j£4,||/||4,||  and  !;<‘//,h/  II, \ 
are  all  of  the  order  of  the  machine  precision  c.  This  is  obvious  for  the  Hessenberg-triangular  reduction  and 
the  direct  deflation  since  each  element  transformed  to  zero  can  indeed  be  put  equal  to  zero  without  affecting 
the  e  bound  (see  [18],  [8]).  Things  are  different  with  the  QR  steps,  since  there  one  puts  off-diagonal  elements 
in  A\  equal  to  zero  only  when  these  elements  have  converged  to  sufficiently  small  elements.  Convergence 
of  the  QR  process  is  thus  needed  to  guarentee  stability  as  well.  Finally,  for  the  reordering  one  needs  to 
prove  that  the  swapping  transformations  indeed  result  in  strictly  upper  triangular  matrices  with  reversed 
order  of  eigenvalues.  This  is  the  subject  of  another  report. 

6  Concluding  remarks 

The  above  decomposition  has  clearly  many  applications  and  we  expect  that  additional  ones  will  be  found  in 
the  future  (e.g.  in  robust  control  of  periodic  systems).  The  above  decomposition  is  also  related  to  [4]  which 
computes  the  Jordan  chains  of  sequences  as  considered  here.  This  generalized  QR  decomposition  in  fact 
plays  the  role  of  the  rank  determination  (via  QR  or  SV D)  needed  to  reconstruct  the  Jordaii/Kronecker 
structure  of  pencils  of  the  type  (18).  This  could  be  used  as  a  preprocessing  to  eliminate  the  chains  at 
A  =  0  or  A  =  oo  and  extract  in  this  manner  a  set  of  smaller  but  invertible  matrices  A,,  Bx  as  was  also 
done  in  section  3.2  via  direct  deflation.  Tbe  advantage  of  this  new  approach  is  that  it  also  identifies  the 
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structural  indices  at  these  two  eigenvalues.  Moreover,  the  generalized  QR  decomposition  allows  for  «im- 
square  matrices  as  well,  and  one  can  thus  consider  systems  of  the  type  (2)  with  m  x  n  matrices  .4,  and 
Bt. 

Similar  unpublished  ideas  are  being  pursued  by  John  Hench,  UC  Santa  Barbara  (personal  communica¬ 
tion),  who  arrives  at  the  same  decomposition  (6)  with  a  different  algorithm.  II is  condensed  form  essential!) 
consists  of  all  matrices  in  Hesseriberg  form  and  all  B,  matrices  in  triangular  form.  We  feel  that  tin- 
connection  with  the  QR  algorithm  then  fails  to  go  through,  although  he  reports  a  good  convergence  of  that 
algorithm  as  well.  Possible  application  to  periodic  continuous  control  systems  are  also  being  considered  by 
him. 

The  present  report  is  a  more  extended  version  of  the  paper  [3]  presented  at  the  SP1E  conference  held 
in  San  Diego  in  July  1992. 

Acknowledgements 

Part  of  this  research  was  performed  while  the  authors  were  visiting  the  Institute  of  Mathematics  and 
Applications  of  the  University  of  Minnesota,  Minneapolis,  during  the  summer  quarter  of  the  Applied  Liue.n 
Algebra  Year  organized  there.  We  greatly  appreciated  the  hospitality  and  the  productive  atmosphere  of  that 
institute.  Bojanczyk  was  partially  supported  by  the  Joint  Services  Electronics  Program  (Grant  F -19020  90 
C-0039  monitored  by  AFOSR).  G.  Golub  was  partially  supported  by  the  Army  Research  (Grant  DAAL03 
90-G-0105)  and  ARGOSystems  (Grant  59613  Dept.  Air  Force).  P.  Van  Dooren  was  partially  supported 
by  the  Research  Board  of  the  University  of  Illinois  at  Urbana-Champaign  (Grant  P  1-2-681 14) and  by  the 
National  Science  Foundation  (Grant  CC'K  9209349). 

B  eferences 

[1]  B.  Francis,  T.  Georgiou,  Stability  theo'ry  for  linear  time-invariant  plants  with  periodic  digital  con¬ 
trollers,  IEEE  Trans.  Aut.  C'ontr.  33  (1988)  820-832. 

[2]  S.  Bittanti,  P.  Colaneri.  G.  de  Nicolao,  The  difference  periodic  Riccati  equation  for  the  periodic 
prediction  problem,  IEEE  Trans.  Aut.  C'ontr.  33  (1988)  70G-712. 

[3]  A.  Bojanczyk,  G.  Golub,  P.  Van  Dooren,  The  periodic  Schur  form.  Algorithms  and  Applications,  SP1E 
Conference,  San  Diego,  July  1992. 

[4]  B.  De  Moor,  P.  Van  Dooren,  Generalizations  of  the  singular  value  and  QR  decomposition.  SIAM  Matr. 
Anal.  &  Applic.  13  (1992). 

[5]  J.  Doyle,  B.  Francis  and  A.  Tannenbaum,  Feedback  Control  Theory,  McMillan,  1992. 

[6]  D.  Flamm,  A  new  shift-invariant  representation  for  periodic  linear  systems.  Proceedings  American 
Control  Conference ,  May  1990,  San  Diego  CA,  1510-1515. 

[7]  J.  Gardiner,  A.  Laub,  A  generalization  of  the  matrix-sign-function  solution  to  the  algebraic  Riccati 
equations,  Int.  Journal  Control  44  (1986)  823-832. 

[8]  G.  Golub,  C.  Van  Loan,  Matrix  Computations  2nd  edition.  The  Johns  Hopkins  University  Press, 
Baltimore,  Maryland,  1989. 

[9]  0.  Grasselli,  S.  Longhi,  The  geometric  approach  for  linear  periodic  discrete-time  systems,  Lm.  Algebra 
&  Applic.  158  (1991)  27-GO. 


16 


197 


[10]  A.  Laub,  Invariant  subspace  methods  for  the  numerical  solution  of  Rirrati  equations,  in  Hu  Hiceuti 
equation ,  Eds.  Bittanti,  Laub,  Willems,  Springer  Verlag,  1900. 

[11]  C.  Moler,  G.  Stewart,  An  algorithm  for  the  generalized  matrix  eigenvalue  problem.  SIAM  .V unite. 
Anal.  10  (1973)  241-256. 

[12]  A.  Sage,  C.  White,  Optimum  Systems  Control,  2nd  Ed.,  Prentice- Hall,  New  Jersey.  1977. 

[13]  P.  Van  Dooren,  The  generalized  eigenstructure  problem  in  linear  system  theory,  IEEE  Tran*.  Aut. 
Contr.  26  (1981)  111-129. 

[14]  P.  Van  Dooren,  A  generalized  eigenvalue  approach  for  solving  Rircati  equations,  SIAM  Sn  t  Sun 
Comp.  2  (1981)  121-135. 

[15]  P.  Van  Dooren,  M.  Verhaegen,  On  the  use  of  unitary  state-spare  transformations,  in  Spinal  Issm  <</ 
Contemporary  Mathematics  on  Linear  Algebra  and  its  Role  in  Linear  System  Theory,  AMS.  19*5. 

[10]  P.  Van  Dooren,  Numerical  aspects  of  system  and  control  algorithms.  Journal  A  30  (19*9)  25-32 

[17]  A.  Varga,  A  Schur  method  for  pole  assignment,  IEEE  Trans.  Aut.  Contr.  26  (19*1)  517-519. 

[18]  J.  H.  Wilkinson,  The  algebraic  eigenvalue  problem.  Clarendon  press,  Oxford.  1905. 


17 


TASK  6 


FAULT  TOLERANT  BEAMFORMING  ALGORITHMS 
F.  T.  Luk 


Computing  the  PSVD  of  two  2  x  2  triangular  matrices 

Gary  E.  Adams  and  Adam  W  Bojanczyk 
School  of  Electrical  Engineering 
Cornel)  f  diversity 
Ithaca.  NY  1  !SYi.  USA 

Franklin  T  Luk 

Department  of  Computer  Science 
Rensselaer  Polytechnic  Institute 
Troy.  NY  121t?0.  USA 


Abstract 

In  tins  paper,  we  propo.-r  .1  method  for  computing  the  SVD  of  a  product  of  two  2  »  2 
triangular  matrices.  \\>-  .'how  liial  uiir  met  hod  is  uuiiicrn  alh  desirable  in  that  all  relevant 
residual  elements  will  be  numencalh  small 


1.  Introduction 

The  problem  of  computing  the  singular  value  decomposition  (SVD)  of  a  product  of  two  matrices  has 
many  applications:  see.  e.g..  [-1]  ami  [5].  The  problem  is  also  closely  related  to  finding  a  generalized 
SVD  of  two  matrices  (cf.  [(>]).  A  crucial  step  in  either  the  product  SVD  (PSVD)  or  the  generalized 
SVD  (GSVDj  problem  is  the  accurate  computation  of  the  PSVD  of  two  2  >.  2  triangular  matrices. 

We  wish  to  achieve  two  objectives:  first,  to  ensure  that  the  transformations  applied  to  the 
triangular  matrices  must  leave  the  matrices  triangular  and.  second,  to  ensure  that  the  SVD  of  the 
product  i»  computed  accurately.  As  discussed  in  a  recent  paper  by  Bai  and  Dettunel  jlj.  these 
two  properties  are  essential  to  guarantee  the  stability  of  the  GSVD  method  [Gj.  Several  strategies 
have  been  proposed  to  preserve  these  two  properties.  In  [lj.  examples  are  presented  where  these 
strategies  can  fail  and  a  new  method  that  overcomes  the  exposed  drawbacks  is  then  proposed. 

In  this  paper  we  propose  an  alternative  approach.  Our  new  method,  which  we  will  call  a  hnlj- 
recursive  method,  is  a  slight  variation  of  the  fully-recursive  method  proposed  in  [2]  for  computing 
the  SVD  of  a  product  of  several  matrices.  We  show  that  our  algorithm  is  simpler  to  implement 
and  enjoys  the  same  nice  numerical  properties  as  the  method  in  [lj. 

Our  paper  is  organized  as  follows.  In  Section  2  we  describe  the  PSVD  of  two  ‘2  x  2  upper 
triangular  matrices.  A  criterion  for  numerical  stability  is  given  in  Section  G  We  present  our  new 
algorithm  in  Section  4,  and  an  error  analysis  in  Section  5.  Finally,  some  detailed  proofs  can  be 
found  in  Appendices  A  and  B.  and  a  numerical  example  in  Appendix  C. 
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2.  Problem  Definition 


Given  two  upper  triangular  matrices: 

/  «i  h  ) 


^i  = 


0  di 


we  call  the  product  A: 
and  let 


and  A>  =  { 

V  0 


A  =  .4,4, 


Q->  0  ■ 


il  >  / 


,  a  b  \ 

.4=1.  . 


\  0  d) 

Our  objective  is  to  find  three  orthogonal  matrices  Q t,  Q,.  Q  <  such  that 

'  a'  0  \ 


A'  =  Q,AQ\  = 


and 


■<  = 

for  t  =  1,2.  The  two  equations  (2.1  •  and  -2  2i  imply  that 


0  <i' 


b  =  -b  ! , . 

In  other  words,  we  would  like  to  find  thrtr  transformations  Qx.  Q-  and  Q  \  to  zero  out  j 
namely,  the  off-diagonal  elem-  tits  of  A  and  the  sub -diagonal  elements  of  4>  and  A 
requirement,  although  mrthen.aticallv  feasible.  ma>  ran--,  runner »ca!  difficulty  if  not 
care;  see  examples  in  [l]  and  [2j.  Our  goal  is  to  develop  an  algorithm  so  that  proper t 
(2.2)  will  be  satisfied  except  for  very  small  numerical  errors.  It.  this  paper,  we  use  tt 
matrix  2-norms: 


2.1.  Relationship  with  GSVD 

The  basic  step  in  a  GSVD  of  two  2  x  2  triangular  matrices  4j  and  A;  is  to  coniptr« 
the  product  A\  ■  adj{  .4;).  where  ad;  denotes  the  adjoint  of  a  matrix.  \W  have 


adj(.-l>)  = 


d2  -b.  - 

0  a:  J 


It  is  therefore  obvious  that  our  two-by-two  PSVD  method  ran  also  be  applied  to  th 
GSVD  problem. 


3.  Criterion  for  Numerical  Stability 

Recall  that  .4j.  A'2  and  .1'  denote  the  tliree  matrices  A , .  A:  and  .1.  respectively.  af’er  th 
transformations  as  defined  in  (2.1 )  and  (2.2)  have  been  performed  I.ct  -1  be  the  com; 
let  Q j.  Qj,  and  Qa  be  the  coinpuied  transformations  Define 


,1'  :=  QrAQl 


fa  l>'  \ 

[f  d'J 


,  U  J  : 


o u r  «T*  rj.c-jxt  s 
’l  l. i-  extra 
ITeaten*  w;ti, 
lies  }  i  and 
X-  vector  and 


■  the  SVD  of 


e  two-bv-two 


e  equivalence 

pa  ted  .4.  and 

(3.1  j 
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•i;  -  Qi  j  2 

Let  t  denote  the  relative  machine  prei isiou.  The  best  that  we  can  *im  fur  i-  tu  compute  j'  •  u,  >, 

that 

•  T  ~  1,  -  U‘ •  (id; 

The  relation  t,T-3j  implies  that  the  t2,ij  element  <*'  uf  A[  will  .-<% 

!  #*|  !  =  (Jit  ■  A,  :  .  ,  i  ; , 

for  !  =  1.2.  (  ondition  (3.4)  implies  that  may  be  safeh  truncated  to  zero.  Thus,  i’  is  ajsw  forced 
to  zero. 

We  prove  in  Section  5  tliat  by  using  our  new  method,  the  computed  matrices  A',  and  A'  will 
satisfy  condition  (3.4)  and  A'  will  satisfy  the  conditions  that 

^  “(hi  .  I  1  ■;  3 .  a  a  : 


-  Oif  \  •  i  5  a 

The  conditions  proposed  m  .  l_  for  i  omptiting  tie-  US\  i)  o|  two  matrices,  ,-i,  and  a<i i<  4 :  >.  follow 
from  (3.-1).  (3.->),  and  the  similar  construction  of  the  two  algorithms. 

4.  New  Algorithm 

In  this  section,  we  propose  a  new  algorithm  for  the  PSYD  problem.  Our  algorithm  is  a  modifi¬ 
cation  of  the  algorithm  presented  in  "2'  for  a  product  of  several  matrices,  The  tool  we  use  is  a 
transformation  discussed  in  Charlier  et  al.  r3‘: 


where  c*  4-  ,$*’  =  1 .  We  may  regard  the  transformation  as  a  permuted  reflection- 


c  s  \  /  0  I 


“  ~  Vi  -cj  \ 1  0/ 

The  reason  behind  using  permuted  reflections  is  that  we  actually  deal  with  an  n  x  n  problem  The 
permutation  that  is  incorporated  into  Q  corresponds  to  the  so  called  odd-even  order  of  eliminations 
in  one  sweep  of  a  Jacobi-SYD  procedure. 

While  each  transformation  Q,  is  defined  by  the  cosine-sine  pair: 

c,  =  cosffi  and  .s,  =  sin  0,  . 


we  also  associate  Q,  with  tiie  tangent 


t,  =  tan  6,  , 


Given  (,,  we  can  easily  recover  c,  and  using  the  relations 


c,  =  — == —  and  s,  =  t,c,  . 

\f  1  +  t} 
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Following  the  exposition  in  [2j.  we  consider  the  result  of  applying  the  Mt  and  right  transformation* 
Q,  and  Q,  to  a  2  x  2  upper  triangular  matux  .1 


We  can  derive  from  (4.3)  these  four  relations: 

t'  =  CtCri-atr  +  dt;  -  b  )  .  (  4  1.1  ■ 

b'  —  cicr(—(iti  +  dt,  4-  b!;t ,  1  .  ( -1 . 4 1* < 

u  =  C;Cri  bt:  d  -f  at  it.  I  .  ( -5  -It  } 

d'  =  f»crt«  -  bt.  dtit.  }  ,  (4  4d  i 


where  //  =  tariff/  arid  tT  =  tantfr.  The  postulates  that  both  «'  and  b'  be  zeros  define  two  conditions 
ori  f;  and  so  that  (4.3)  represents  an  SVD  of  .3.  The  postulate  that  <'  be  zero  defines  a  condition 
relating  to  ft, ,  so  that  if  one  i>  known  the  other  can  be  computed  in  order  to  reduo-  .V  to  an 
upper  triangular  form,  f  or  ease  of  exposition,  assume  for  now  that  ub<!  =  0  I  his  condition  veil]  lie 
removed  in  Section  5.2.  It  implies  that  r.r,  x  0.  and  so  tin-  postulate  that  < '  =  0  m  M.-lai  becomes 

-at,  ill >  -  b  =  0  .  i  4 } 

The  consequence  of  (-l.de)  is  th.it  i-l.-h  j  ami  il.ldi  simplify  to 

n  =  rr.l  I'  a-  1  \tj  I  4  1  f  i 


and 


(/'  =  VC,  if;  1-  1  ill  . 


respectively.  The  relations  t-l.  lf;  and  (4.4g)  imply  that 


(4. Ig ) 


a'll'  =  ad  . 


lot  the  SYi)  problem.  both  <’  and  b‘  are  z,ei<.»  and  we  can  Use  .  J.-le)  to  leduce  (4.4b)  either  to  ati 
equation  in  //: 

/  lul  \  J 

If  -f  2/;cr,  -1|.  I  4 .5a ) 


when 


or  to  an  equation  in  /, 


where 


v  = 

\  <1 


//  =  cfcr  (  —  J  [t2r  +  2t*n.  -  1  j  . 
d 


i( 


I  f, /*  -  a1 


From  (1.5,ii  w»*  gel  a  quadratic  equation  by  setting  //  to  gem: 

if  4-  2 IT;/;  -1=0 


1 4.5b 


4.5c ) 
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and  from  ( -l.'>t>)  we  get 


t  i  ~  u 

1  he  two  •  -quatmm  (  l  oci  ami  {  l.’>d;  ar<-  solved  bv  tie-  f. 'nnuS.t-.  •* j \ **n  »i,  '2 


!  ft  -  all  d  --  i!  ■ 

r  — - 

6 

r  -  b 
a ,  =  — ■ — - 

2d 

rTii 

nv  =  -  . 

2  a 

1 

<*(  -r  siglll<7,)y  <T‘  -  1 


i  1 


i  1  bit 


(-5  in-  ! 


l-l.hd  i 


I  -  _____  1 

O .  -  signi  Cf .  I  Y  rr-  1 

in  finite- precision  ant  hi  net  ic.  '-it  her  one  ol  /,  and  t ,  can  he  .  imipuu-it  w  it  h  a  Inciter  relative  pr»‘c  i-iou 
in  particular,  if 

sigm  /  )  =  —  > i ii n i  b  i  . 


then  (4. (id)  will  produce  a  very  accurate  whereas  if 


sign!  r  j  -  sigm  b  i  . 


then  1 4 .  Go  *  will  produce  a  very  precise  n.  If  r  =  0.  then  both  t:  and  <  will  be  computed  with  tie- 
same  relative  accuracy. 

Now.  let  r  p  0.  We  first  present  a  lemma  relating;  the  >izos  of  r  at..!  t.  to  those  of  a  and  d 


Lemma  4.1.  Let  ubdr  £  0.  If  j  a  j  >|  <1  j  .  then  i  ct,  !  >|  <7,  |  ami  :  f;  j  <j  tT  !  .  Conversely,  if 
j  a  j  <[  d  j  .  then  |  cr(  |  <|  ar  |  and  I  !  >|  L  i  • 

Proof.  See  (2j.  Q 

We  are  ready  to  present  an  algorithm  for  computing  the  three  orthogonal  matrices  Qi%  Q ,  and 
Q 3,  such  that  (2.1 )  and  (2.2)  are  satisfied.  The  algorithm  proceeds  in  two  stages.  In  the  first  stage, 
we  calculate  the  product  A  explicitly: 

<i  -  (t\ a>  .  (4.7a) 

b  =  aj62  -f  b\d2  .  (4.7b) 

d  =  dxd2  .  (4.7c) 

We  use  (4.(ia)  to  calculate  r,  and  then  compute  either  <7;  or  ar  so  that  the  corresponding  tangent 
defines  tne  smaller  angular  rotation.  Hence  we  obtain  either  t j  or  tj.  In  the  second  stage,  we  use 
the  relation  (4.4e)  with  t}  or  tj  as  the  reference  tangent  to  compute  the  remaining  transformations. 
Suppose  that  t ]  is  known,  then  t2  and  *3  arc  generated  by  the  forward  substitutions: 


h 


d\t i  -  b i 

«i 


(4.8a) 
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dt  >  -  b 


On  the  other  hand,  if  1 3  is  known,  then  t  >  and  tj  are  generated  by  the  backward  substitution 


+  h 


( -l -be  i 


at  3  -f  b 

1 ,  = -i— .  n-fed) 

a 

If  ti  is  computed  first  as  the  reference  tangent,  then  (4.8a)  wiLl  guarantee  that  A',  will  be  numerically 
upper  triangular  and  (4.8b)  will  guarantee  that  A'  will  be  numerically  diagonal.  As  will  be  shown 
later,  these  two  properties  will  guarantee  that  ,4'2  will  be  numerically  upper  triangular  and  hence 
both  (3-4)  and  (3.5)  will  be  satisfied. 

We  refer  to  the  method  defined  by  (4.8a}-(4.8b) or  (4.8c)-(4.8d)  as  half-rt cursive .  to  differentiate 
it  from  the  fully-recursive  method  proposed  in  [2]  for  computing  the  PSYD  of  several  matrices.  The 
fully-recursive  method  also  picks  the  smaller  outer  angular  rotation  as  the  starting  point  for  the 
recursion,  from  which  all  remaining  rotations  are  computed.  However  in  [2].  the  other  outer  rotation 
is  computed  from  the  previous  rotation  in  the  sequence.  For  example,  its  the  case  of  a  product  of 
two  matrices,  the  tangent  1 3  in  (4.8b)  would  be  computed  from  t2  using  (4.4e); 

d2l2  -  b> 

<3=- — - £-  (4.9) 

Note  how  (4.8b)  uses  the  product  A  whereas  (4.9)  uses  the  matrix  .4 2.  It  was  shown  in  [I;  that 
the  fully-recursive  method  may  fail  to  satisfy  (3.5)  and  thus  is  not  recommended  for  the  GSYD 
problem.  On  the  other  hand,  the  fully-recursive  method  easily  extends  to  any  number  of  factors  in 
the  product.  It  is  not  clear  what  is  an  appropriate  extension  of  the  half-recursive  method  for  the 
case  of  a  product  of  more  than  two  matrices. 

Our  half-recursive  method  is  equivalent  to  the  method  proposed  by  Bai  and  Demme)  in  [lj  in 
the  sense  that  it  also  computes  a  very  accurate  PSYD  of  AjA2.  and  that  it  uses  essentially  the 
same  criterion  in  choosing  whether  to  compute  the  middle  transformation  Q 2  from  Qj  or  from  Q3. 
A  proof  that  the  two  methods  use  the  same  condition  for  computing  Q2  is  given  in  Appendix  B. 

5.  Backward  Error  Analysis 

In  this  section,  we  present  a  backward  error  analysis  of  our  computation.  The  function  fl(a)  will 
be  used  to  denote  the  floating  point  approximation  of  a.  For  the  purpose  of  this  analysis,  a  "bar" 
denotes  a  computed  quantity  which  is  perturbed  as  the  result  of  inexact  arithmetic.  For  example, 
instead  of  a.  b  and  d.  we  have  the  perturbed  values  d.  b  and  d  which  result  from  floating  point 
computation  of  .4j.42.  We  assume  that  exact  arithmetic  may  be  performed  using  these  perturbed 
values.  The  "tilde"  symbol  is  used  to  denote  conceptual  values  computed  exactly  from  perturbed 
data.  For  example,  f  denotes  the  result  of  using  formula  (4.Ga)  in  exact  arithmetic  with  the 
perturbed  data  a.  b  and  d. 

In  our  error  analysis,  we  adopt  a  convention  that  involves  a  liberal  use  of  Greek  letters.  For 
example,  by  a  we  mean  a  relative  perturbation  of  an  absolute  magnitude  not  greater  than  e.  where 
c  denotes  the  machine  precision.  All  terms  of  order  e2  or  higher  will  be  ignored  in  this  first-order 
analvsis. 
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We  start  our  procedure  by  computing  elements  of  the  product  matrix  .*1 

d  :  =  fl  { a  i  u  > )  ~  u  i  a  ^  f  1  -f  a }  .  (ml  a ) 

d  :  =  fll  <i,  d> )  ~  djd.-t  1  -e  8 }  .  , 5.  1  it  i 

6  i  “  fi ( u  ,  6_>  •+■  b  ,  d > )  —  u  ,  b  j(  1  4*  2.J ,  i  t  b  ,  d  1  s-  2- )  -  i  5 . 1  c  i 

where,  according  to  our  convention,  the  parameters  oj,  8,.  .i,.  T>.  and  Js  are  all  quantities  whose 

absolute  values  are  bounded  by  t  From  (5.1)  it  follows  that 

A  =  (  .d  ,  +  6  A  \  )(  .d  2  -f  8  -d  i  j  . 

with  ||  6A,  ||  <  c  ||  A,  ||  .  This  property,  which  in  general  does  not  hold  for  a  product  of  more 
than  two  2  x  2  upper  triangular  matrices,  will  allow  us  to  prove  backward  error  type  assertions  on 
the  half-recursive  method. 

Our  analysis  is  divided  into  two  parts.  In  Section  5.1.  we  consider  a  regular  case  where  all 
elements  of  the  computed  matrix  product  are  numerically  significant  with  respect  to  the  maximal- 
in-magnitude  element;  i.e.. 

mint  |  «  i  .  |  6  |  .  |  d  |  )  >  *  ma.\(  j  «  j  .  |  6  •  .  j  </"  |  )  .  ( 5  2  j 

In  Section  5.2,  we  consider  special  cases  where  at  least  one  element  of  the  computed  .d  is  numerically 
insignificant. 


5.1.  Regular  Case 

Without  loss  of  generality,  we  assume  that  rb  <  0;  i.e.,  sign(r)  =  -sign{6).  Thus  we  compute  tj 
first  as  the  reference  tangent  from  which  t>  and  1 3  will  be  next  determined  via  (4.8a)  and  (4.8b). 
respectively.  We  recall  several  lemmas  from  [2 j . 

Lemma  5.1.  Let  t,  and  /,  be  the  exact  and  computed  solutions,  respectively,  of  equation 
(4.6d)  with  data  a,  6,  d.  Moreover,  let  c,.  s,  and  c, .  s,  be  the  exact  and  computed  cosines  and  sines 
using  (4.2)  with  the  tangent  value  f,.  Then 

h  =  Ml  +  10c,)  . 

c,  =  c,(  1  +  3/u, )  . 
s,  =  s,(  1  +  4rq )  . 

where  |«i|<<,|/i,|<c.  and  |  u\  |  <  c. 

Proof.  See  [2].  □ 

In  other  words.  Lemma  5.1  states  that  the  procedure  (4.6a)-(4.6e)  for  solving  (4.5c)  is  nu¬ 
merically  stable  in  the  forward  sense.  Two  lemmas  follow,  leading  to  our  main  result  of  Theorem 
5.1. 


(5.3a) 

(5.3b) 

(5.3c) 


Lemma  5.2.,  The  recurrences  (4.8a)  and  (4.8b)  yield  tj  and  (3  such  that 

d\ii  —  d\t\  +6,  =  0  ,  (5.4a) 
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( 5 .  -4  b  i 


6(3  -  dt,  +  b  -  0  . 


U\  —  G,  (  1  4-  2  )  .  it  I  —  (l\  i  1  +  O,  )  .  i  t.ic  \ 

a  =  1  +  2v)  .  d  -  d{  1  +  q).  i  5.1  <]  J 

Proof.  The  proof  easily  follows  from  (4.8a)  and  (4.8b).  Q 

Lemma  5.3.  The  recurrence  (4.8b)  yields  t3  such  that  /3  =  i 1  -+  1 A i  . 

Proof.  From  (4.8b) 

( dh(i +  Uv)-b\  n  4  /d/,-6.  1 1  id/,  \  (.  -d/,\ 

_  f - - - j  ( 1  +  i )  =  f  — — —  4 - - —  j  ( 1  +  2~;\  j  =  f  t i  +  1 1  i  ljr  j~  j  ( 1  4  |  . 

Since  jd/«j  <  1  and  j / ,  / f 3 1  <  1.  we  got  (3  =  f3(  1  +  1.')'  ),  Q 

We  now  show  that  d'  and  d'  are  computed  with  high  relative  precision. 

Theorem  5.1.  Let  a'  and  d'  be  the  exact  singular  values  of  the  computed  product  A.  If  u' 
and  d'  are  computed  via  relation*  (4.4c)  and  (4.4d).  then  the  computed  singular  value-  a'  and  d' 
satisfy  t ho  following  relations 

It  —  tl'f  1  +  O.,  )  .  <!'  —  d'(  1  -r  (\,  j  .  I  3.5 ) 


Proof.  From  (4.4f)  am)  (4-4g).  we  get 

«'  -  d{l\  -8  1 ) c j c 3  and  d'  =  a(t’}  A  1  )r,t  .1  . 

where  /,  and  /3  are  the  exact  tangents  corresponding  to  the  data  a.  b  and  d  and  /. 
the  lemma  follows  from  Lemmas  5.1  and  5.3.  Q 


=  s,  /<\.  Thus. 


Theorem  5.2.  Suppose  that  the  computed  tangent  value-  are  !,  ami  f3.  Let  c,.  .s, .  f3  and 
S3  he  the  corresponding  exact  cosine  and  sine  values.  Let 

<’  ■■=  c,c3[-a/3  +  d<,  -  6]  .  (5.6) 

6'  :=  c,f3[-«J,  +  di 3  +  6/,/j)  .  (5.7) 

That  is.  ('  and  6'  are  the  exact  values  of  r’  and  6'.  respectively,  corresponding  to  the  computed 
data  a.  6.  d,  /,  and  (3.  Then 

1  <"  t  <  A  ,c  I!  .4  !;  .  (5.8) 

I  i>  i  <  A>  II  .4  I!  .  (5.9) 

where  A',  and  A’^  are  some  positive  constants. 

Proof.  See  Appendix  A.  Q 

Theorems  5.1  and  5.2  together  state  that  the  SYD  of  the  upper  triangular  matrix  A  is  computed 
very  accurately.  We  now  justify  why  the  (2.1)  element  in  the  computed  matrix  ,4J  can  be  set  to 
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zero  by  showing  that  j  <[  |  corresponds  to  a  relative  am  •■lementwi.se  perturbation  of  A\  of  the 
order  of  c.  Let  the  cosine  ami  sine  pairs  c,  and  s,  satisfy  t,  =  .*,/<*,,  for  t  =  1.2.3.  Fiom  t-1.2)  we 
can  derive  that 

c i  :  —  fl'r,)  =  t'i(  i  -f  3/t, )  •  ( 5.  i(Ja  • 

\  :=  fl(  )  =  •>,(  !  +  U>,  I  .  {  j.  1  Ob  ! 

Let  .-1'  denote  the  exact  updated  matrix  derived  from  A.,  c,.  c.  +  ).  and  s,t1.  Our  next  results 
provide  a  bound  on  the  element  t  —  l . ‘J.  defined  by  the  relation 

1 1  :=  +  XiC,  +  | it,  —  c,c(+ 1 b,  .  (alii 


Theorem  5.3.  The  matrices  .1',  and  .4'2  are  almost  upper  triangular  in  that  their  (2.1  j 
elements  r'  and  e',  satisfy  the  inequalities: 


<  3  f 


-1, 


and 

1  U  '  <  A'  s  '  i!  -4 j 


( 5. 1 2a  j 

(5.  \2h  i 


Proof.  Note  that  .4',  is  the  same  for  both  fully-recursive  and  half  recursive  methods.  The 
proof  that  ,4j  is  almost  tipper  triangular  in  the  sense  that  (5.12a)  holds  can  be  found  in  [2j. 

To  prove  the  second  part  of  the  theorem  from  (5.4a)-(5.4d )  and  (5.1a)-(5.1c),  we  get  the  fol¬ 
lowing  two  relations  to  first  order  of  the  machine  precision: 


ft  i  ( 1  -+-  2  c  i  )tj  -  c/j  ( 1  +  O]  Ki  4-  />)  =  ft  . 


(5.1da) 


Q\ii2(  1  +  g  +  'Ji'jh  —  r/t r/_> ( 1  +  t1  +  o)h  4-  « 1 6 j f  1  4-  2J|  j  4-  /;|d>(  1  4-  2J})  =  0  .  (5. 13b) 

By  multiplying  both  sides  of  (5.13a)  by  t/2(  1  4-  2 J2)  and  subtracting  from  (5.13b)  we  obtain 

\  {ttz(  1  4-  o  4-  2c )H  —  ^  (<5  4-  o  —  0|  +  2.ii)t\  4-  6j(  1  4-  2.3| )  —  d>{  1  4-  2.32  4-  2vi  )<2 }  =  0  . 

or.  since  Q!  ^  0. 


«2(  1  4  o  4  2c)  <3  — 


d\<tj 


•+•  o  —  Oj  +  2 ■  3 2  )h  4-  1  4-  2.3 1 )  —  c/2 { 1  4-  2J2  4-  2 c i  )t2 

=  02/3  —  d 2 (2  4*  l>2  4-  A  =  0  , 


where 


Thus,  we  can  rewrite  (5.11)  for  i  =  2  as 


A  —  ct 2 ( c*  4"  2c)h  —  (  —  J  « 2 ( 2*  +  O  —  <2»i  +  2,3-2 )(j  4-  62, >3 1  —  <M2.32  4-  2tq  )<2  . 


e2  —  — C2  s3a2  +  ^2^3 <3 2  ~  C2C362  4-  CjC2(a2h  ~  d2h  +  62  +  A  )  .  1 5.13c) 

Now,  as  we  start  the  half-recursive  method  from  h,  it  means  that  |h|  <  |t3|  and  jd|  <  |a|.  Hence 
from  (5.10a),  (5.10b)  and  (5.13c),  we  derive  the  inequality: 

1^2 1  £  |-53C2a2( Ck  4-  2c)|  4-  \c3C2a2(6  4-  0  -  <Pi  4-  2^)|  4-  4-  |c3S2d2(2^2  +  2c’i)| 
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<  fc's «I1A2|| . 

completing  the  proof.  Q 

In  summary,  we  have  proved  two  results  using  backward  error  analysis.  First,  the  transformed 
matrix  A'  is  almost  diagonal  in  that  inequalities  (5.8)  and  (5.9)  both  hold.  Second,  we  can  safely 
set  each  computed  matrix  A[,  i  -  1,2.  to  a  triangular  form  because  (5.12a. )  and  (5.12b)  are  valid . 
As  a  final  note,  even  though  we  have  assumed  that  rb  <  0.  we  can  easily  prove  similar  results  for 
the  case  where  rb  >  0. 

5.2.  Special  Cases 

In  this  subsection,  we  assume  that  inequality  (5.2)  is  violated.  To  be  specific,  define 


7  :=  min([  a  |  ,  |  b  |  ,  |  d  |  ) 

(5.34) 

and 

F  :=  max(  !h|  .  j  6  |  ,|c/j). 

(5.15) 

Now, 

7  <  (  r ; 

(5.16) 

i.e.,  one  of  the  elements  of  A  is  numerically  insi  lificant.  This  situation  requires  modifications 
to  our  algorithm,  since  the  proposed  formulas  may  break  down.  In  particular,  we  do  not  solve  a 
quadratic  equation  to  determine  either  tx  or  (3.  Instead,  we  set  one  of  the  two  tangents  to  zero  and 
attempt  to  compute  all  the  other  tangents  from  the  recurrences.  We  divide  the  special  cases  into 
three  groups,  first. 

|  a  |  +  j  d  |  ^  0  and  jbj^O.  (5.17) 

second. 

|  a  |  +  |J|  =  0  and  |6[^0.  (5.18) 

and  third. 

|  6  |  =  0  .  (5.19) 


First,  assume  that  (5.17)  holds.  Hence  at  least  one.  but  not  all.  of  the  following  three  conditions 
hold: 


7  =  6.  7  =  a  or  7  =  d  . 


We  set  t]  to  zero  if 

I  a  |  >  |  d  |  , 


and  set  (3  to  zero  if 


I  a  t  <  I  d  | 


(5.20) 


(5.21) 


Thus,  the  sizes  of  the  diagonal  elements  of  .4  will  be  compared  to  determine  which  one  of  tx  or 
(3  should  be  zeroed.  Without  loss  of  generality,  assume  that  (5.20)  holds;  hence,  f]  becomes  the 
reference  angle.  So,  t?  and  (3  are  computed  from  recurrence  (4.8a)  and  (4.8b).  Further,  since 
t j  =  0  it  follows  that  / 3  =  -b/a.  Substituting  these  values  into  (5.G)  and  (5.7),  we  can  verify  that 
Theorem  5.2  holds.  Similarly.  Theorem  5.3  follows  from  (5-11 ).  We  note  that  it  is  very  important 
to  decide  which  reference  angle  to  choose,  even  for  the  case  when  b  is  numerically  zero.  At  first, 
the  choice  of  the  reference  angle  may  seem  arbitrary  for  a  "small”  6,  since  either  ix  or  (3  can  be  set 
to  zero.  However,  an  unnecessarily  large  error  may  occur  unless  we  pay  special  care. 
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Second,  assume  that  (5. IS)  holds.  Then,  at  least  one  of  the  u,’s  equals  zero  and  at  least  one  of 
the  d/s  also  equals  zero,  for  i.j  =  1.2.  A  solution  is  to  permute  either  the  rows  or  the  columns  m 
order  to  ensure  that  the  transformed  product  is  diagonal  and  that  the  data  are  reordered.  Hence  fur 
this  case,  we  may  set  the  two  extreme  tangents  to  {0.x}.  resulting  in  the  transformations 

being  rotations  of  negative  amen  and  zero  degrees,  respectively.  To  be  specific,  consider  the  case 
where  one  or  more  «,’s  equal  zero.  If  oj  =  0,  set  f,  =  0  and  ij  =  /,  =  x.  If  a t  ?  0  and  aj  =  0.  set 
i\  =  0,  compute  t2  from  the  forward  recurrence  and  set  f3  =  x.  Note  that  we  may  also  choose  to 
determine  the  tangents  using  the  values  of  the  d/s. 

Third,  assume  that  (5.19)  holds.  We  need  to  account  for  the  fact  that  we  are  really  solving  an 
n  x  n  problem.  Although  the  2  x  2  subproblem  is  already  numerically  diagonal,  it  is  not  sufficient  to 
set  ii  =  =  x.  which  will  leave  the  2  x  2  product  unchanged.  The  ri  x  n  data  need  to  be  reordered, 
calling  for  D  =  (3  =  0;  i.e.,  the  affected  rows  and  columns  will  be  permuted.  Unfortunately,  while 
applying  the  symmetric  permutation,  the  triangular  structures  of  both  A [  and  A2  are  destroyed. 
Therefore,  t2  is  determined  from  the  recurrence. 


6.  Concluding  Remark 

In  this  paper  we  have  presented  a  simple  and  accurate  way  to  calculate  the  PSVD  or  GSYD  of 
two  2  x  2  upper  triangular  matrices.  In  Appendix  C  we  present  an  example  which  shows  that  our 
half-recursive  method  produces  identical  numerical  results  as  the  method  in  [1],  A  significant  issue 
in  the  design  of  PSVD  algorithms  is  how  to  compute  the  middle  transformation.  The  method  used 
in  our  half-recursive  algorithm  is  computationally  more  efficient  than  the  method  in  [1]  and  yields 
identical  results.  The  following  table  lists  the  number  of  floating  point  operations  to  compute  the 
three  transformations.  Q\,  Q2.  and  Q 3.  for  the  three  different  algorithms  in  the  regular  case.  The 
column  labeled  “Simplified  Direct”  lists  the  operation  count  for  th-  Bai  \nd  Demmel  algorithm  if 
our  simplified  method  is  substituted  for  their  method  of  computing  the  middle  transformation. 

Floating-point  Operation  Counts 

Direct  Simplified  Direct  Half-recursive 


Addition 

29 

23 

26 

Multiplication 

57 

45 

41 

Division 

13 

n 

8 

Square  Root 

4 

4 

4 

The  Half-recursive  method  is  less  expensive  than  the  Direct  method  and  similar  in  cost  to  the 
Simplified  Direct  algorithm.  In  addition,  the  upper-triangular  structure  of  the  2x2  matrices  is 
maintained  by  the  Half-recursive  method.  Application  of  the  2  x  2  Half-recursive  algorithm  to  n  x  n 
problems  is  a  topic  for  further  investigaion. 
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Appendices 


A  Proof  of  Theorem  5.2 
We  first  present  a  lemma. 

Lemma  A.l.  Let  by  and  ty  be  the  exact  values  corresponding  to  the  given  data  a.  b  and  d. 
and  let  t\  be  the  computed  value  of  ty.  Define  a  residual  r,  by 


rj  :=  —  (tf  +  2 axiy  -  1 )  . 

a 

(A.l) 

Then 

where  is  a  positive  constant. 

|  r,  |  <  A’4<|Z*|  , 

(A. 2) 

Proof.  See  the  proof  of  Lemma  5.2  in  [2].  Q 

We  now  have  the  necessary  tools  for  proving  the  theorem. 

Proof  {of  Theorem  5.2).  First,  from  Lemma  5.2  and  relation  (5.4b)  we  get 

e  =  c i c 3 [ (  —  at  3  +  <lt i  -  b)  +  ( <ii:i  -  dty  +  bj j  =  (a  -  d)cyi3  -  (d  -  d)^yc3  . 

Using  (5.1a)-(5.1b)  and  ( 5 . 4 d )  we  prove  the  inequality: 

I  f'  i  <  Kt  (|  «  |  +  |  d\  )  <  A'j<  |1  A  ||  .  (A. 3) 


Second,  rewrite  (A.l)  as 

rj  =  r[dbfi  +  i\{d2  -  a2  -  b2}  -  d&]  =  -{{city  -  b){bty  +  d)  -  Del2]  . 

a  a 

From  (5.G)  we  obtain 

1  ,  -  V 

~(dt i  -  b)  ~  1 3  +  r  —  r  ■ 

u  Cj  Cju 

Substituting  (A. 5)  into  (A. 4)  and  rearranging  terms,  we  get 

.—  j  -  -  (' ( biy  +  d  j 

—  (it  y  “t"  dt3  +  btyf  3  —  ?’]  —  — - — -  , 

CyCjyU 


and  so 


From  (4.Gd)  we  derive 


and  from  (4.0b )  we  get 


1  1 


b 

2d 


(A. 4) 

(A.  5 ) 

( A  .6 ) 
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It  follows  that 


-nut'  we  have  a.-ouined  that  i  </  i  <i  a  !  .  Fin  all  v.  recall  from  (Ad)  that  fj  =  tjil  -  10*  rj.  and  u-e 
(A.U),  Lemma  A.l  and  (A.j)  a*  obtain 

1  b'  I  <  c i v y  !  r,  1  4-  2  1  f '  |  <  l\,i  jl  ,1  ||  .  !  A .>  i 


thus  completing  the  proof. 


□ 


B  How  to  Compute  the  Middle  Transformation 


As  pointed  out  by  Bai  and  Deiumel  in  [ip  a  critical  issue  concerns  how  the  middle  transformation 
should  be  computed.  They  proposed  the  following  scheme  for  its  computation  after  both  end 
transformations  have  been  determined.  In  order  to  relate  the  test  for  computing  Q2  in  [li  to  the 
test  in  the  half- recursive  method,  we  first  translate  our  setting  to  that  in  [lj.  Let 


rT  ~  ( f>  -s‘ 

V  ■>  i  C I 

Note  that  the  relation,  given  bv 

CM.  = 


Q 


C>  -Si 

I  Cl 


and 


■M  fi 
-c,  s, 


a  i  b\ 

0  </, 


e  i 


Si«i  Ml  -r  c | d i 
c  i  u  i  —  r  1  /j  1  +  -s  1  </ 1 


upon  permuting  rows  and  changing  the  signs  of  the  top  row.  is  equivalent  to 


Similarlv. 


rT.  t, 


c,  -s, 
>'l  t'l 


«•»  b ) 


"1  bx 
0  dx 


<VM  t?  1 6  [  —  t 


*i«i  s\b |  +  <  i rf\ 


=  G 


M  -t’3 
C't 


■S3«.|  +  C3b2  —  C3U2  +  ■S3b2 

cyh  »3d2 


0  (1 2 

By  changing  the  sign  of  the  second  columns  and  permuting  columns,  we  obtain 

C3  --**3  ^  (  d2  -b2 


V1  adj( A 2)  - 


•S3  C3 


0  «2 


c3(i2  -c3b2  -  s3a2  }  =  jj 

S3(l2  —s3b2  +  C3C12 


(B.lai 


(B.lb) 


( B.'Ja ) 


(B.2b) 


In  [l],  Bai  and  Deiumel  used  (B.lb)  and  (B.'2b)  as  a  starting  point  for  computing  Q 2.  Their 
argument  is  as  follows.  After  postmultiplications  of  both  (B.lb)  and  ( B . *2 b )  by  Q2,  the  (1.2) 
elements  of  G  and  //  should  become  zeros.  Now,  one  should  compute  Q 2  from  the  one  product, 
either  G  or  H ,  for  which  the  computed  element  in  the  (  1.2)  position  has  a  smaller  error  relative 
to  the  norm  of  the  row  in  which  it  resides.  The  magnitude  of  that  error  can  be  only  bounded  and 
hence  the  test  for  the  choice  is  based  on  the  bounds  of  the  errors.  It  is  easy  to  see  that  the  bound 
g  for  the  relative  error  in  the  ( 1.2)  element  of  the  computed  G  is 


-  K'i^i  I  +  IMil 
k.fl. I  +  lci&i  -  Mil 


( B.3a) 
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while  the  bound  h  for  the  relative  error  in  the  (1,2)  element  of  the  computed  //  is 


Ickrl  4-  | 

Ickil  +  |f3^2  +  •'•3  <IJ 


IB.  3b) 


Now  if  g  <  h,  then  Bai  and  Demmel  compute  Q2  from  i'T A  and  otherwise  from  \'r B.  The  next 
lemma  shows  that  the  conditions  specifying  how  Q2  is  computed  by  Bai  and  Demmel  and  by  the 
half-recursive  method  are  essentially  equivalent. 


Lemma  B.l.  In  exact  arithmetic,  the  condition 

g  <  It  (B.4a ) 

where  g  is  defined  by  (B.3a)  and  h  is  defined  by  (B.3b)  is  equivalent  to  the  condition 

\a\>\d\.  ( B.4b) 


Proof.  First  note  that  (B.3a)  and  (B.3b)  can  be  simplified  to 

kii  +  IMil 


<J  -  - 


and 


I,  =  - 


juj ;  +  |t|  r/i  -  b\ ( 
+  ihiuj! 


\<IA  +  lh“:  +  hi 

respectively.  Through  (4.8a  >  and  (4.8c)  the  relations  (B.3n)  and  (B.3b)  simplify  furlher  to 


( B  oa ) 


{ B.5b) 


and 


respectively. 


ki!  4- 1 / 1  o')  1 
i«il(  1  4-  ioi) 

( B.Ga ) 

f  ...  1^!  +  Usfl.’l 
krill  +  lol)' 

(B.Gb) 

Hence  (B.4a)  is  equivalent 

to 

IMr! 

~b  |/| d\  ^  !«]/;/  *t*  \at:s\  . 

(B.7) 

We  now  prove  tliat  (B.4b)  implies  (B.4a).  The  proof  that  |n|  <  |r/j  implies  that  g  <  h  is  analogous 
and  is  omitted.  Our  proof  is  elementary  but  tedious  as  it  requires  us  to  consider  a  large  number  of 
cases.  Assume  that  j«|  >  )6|.  Then  Lemma  4.1  implies  that  (3  >  t).  From  (4.8b)  we  see  that 


jfi/j  -+-  hi  — 


and  as  jo/31  >  |«/ji  we  conclude  that 

sign! at :i )  =  -siun(/i)=  — si«n( «i  b,  b\ <l2  1  .  (B>) 

as  from  (4.7b)  6  =  atb2  +  bx <l2.  Substituting  (4.8b)  into  ( B.7)  and  using  (4.7b)  again  we  get  that 
(B.7)  is  equivalent  to  the  following  inequality: 

| b^l :\  4-  (nf;j  4-  (i)h2  4-  <  ! « 1  /> > j  4-  |«(;d  ■  (B.9) 


Case  1.  -|bj  >  [ b\flt\  -  |«i bA- 
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Then 

Ififil  \dti  \  —  |5|  >  |<^i|  -t-  !M.>|  —  j«i52j  • 

establishing  ( B.7). 

Case  '2a.  — 16|  >  |6]d>j  -  and  |«f3j  >  |i|. 

Then  ja 1 6^ j  >  |/q d>\  and  using  (B.8)  we  obtain  that 

[Mil  -r  jd/i  |  —  \bid>\  +  jo f3  +  a i b>  +  hirf2l  —  j ei f 3 (  4-  2  |hj d2\  —  jo  j b2\  . 
from  which  (B.7)  follows. 

Case  2b.  — 16|  >  1 6 1  c/  >  |  —  j  a  1 63 1  and  |nf3j  <  |6[. 

Then  again  |a|62|  >  |6i</2j.  Now  from  (B.8) 

|£»id2|  +  |d?jj  =  !  6 1  r/  _» !  +  |n?3  +  <i\b2  +  b\d2\ 

=  !M.'i  -  i at-i\  +  \n\b2\  -  \bxd2\  =  jo,62|  -  ja#3t  . 
from  which  (B.7)  again  follows. 


Remark.  Note  that  there  might  be  a  slight  difference  in  using  (B.4a)  or  (B.-lb)  as  the  lemma 
holds  only  in  exact  arithmetic.  In  finite  precision  computation,  the  relations  (B.4a)  and  (B.4b)  may 
not  always  be  equivalent.  However,  we  have  not  been  able  to  find  any  numerical  example  where 
these  two  conditions  are  not  equivalent.  Moreover,  as  shown  in  this  paper,  the  consequences  of 
numerical  non-equivalence  are  numerically  insignificant. 


C  Numerical  Example 


It  has  been  proved  in  Appendix  B  that  the  half-recursive  procedure  computes  essentially  the  same 
numerical  results  as  the  direct  method  of  [1].  For  both  methods,  the  end  transformations  are 
computed  explicitly  from  the  product  A  =  M[M2.  and  the  middle  transformation  is  computed 
from  the  same  direction.  The  greatest  difference  between  the  fully-recursive  method  and  the  other 
two  occurs  when  there  is  cancellation  in  forming  the  product  A  =  Mj.42.  In  the  following  PSVD 
exam  pie,  Mi  and  .42  each  has  an  0(1)  norm,  but  the  product  MjM2  has  an  O(10~5)  norm.  Hence 
errors  which  are  small  relative  to  the  initial  matrices  may  be  large  relative  to  the  product. 


(  2. 3 16797292247488c  +  00  - 1 .437687878748196c  -  01  \ 
^  0  —5. 208536329 10772be-0G  J 


/  2.4724998 11 756353e- 05  2  624474233535929e  -  01  \ 
V  0  4.229273 18767 1 00 1  e  -+-  00  / 


f  5  728280608959543c  -  05  -1.1 10223024625157c  -  16  \ 

AiA’~  V  0  —2 .202832304370565c  -  05  ) 

The  three  methods  all  compute  the  left  transformation  from  the  explicit  product  and  calculate 
the  middle  transformation  from  A\.  We  use  the  subscripts  dir,  hr,  and  fr  to  distinguish  between 
results  computed  via  the  direct,  half-recursive,  and  fully-recursive  methods,  respectively.  The 
computed  values  of  A\  dir,  A\  hr ,  and  M'j  jT  are  numerically  identical  in  that  the  corresponding 
entries  are  numerically  equal: 
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/  2  32i2>37i*UUV;7Vf<  -  uu  V  77  V 
l  3  1 2  7  :♦  3  u  U  7  o  *  *_»  2V-  *  7  *  —  ij  7  -  I  v " 


!  V*>  ;*  3  o  fj  3  3  *>  V  7?  >  <  -  *•  *  - 
7  <  o  }  'j  7  ■)  h  i  >'.  2  >.fir  --  4 « 


•,  /  -5  it«;v  -  ob  -3  yj 7  v;i\it,>?b»v2c' 7 *  -  >  "  f 

1  ' '  l  —2  775$575615o2*ylc  —  17  2'  3. : .'  >3 .  ■•  (  i .  7>'  <  •»  uu 

The  computed  matrices  .47  K, .  and  .-17,,  are  numerically  triangular.  but  now  th<-  • 

element  of  .47  ,r  is  significantly  different  from  the  corresponding  elements  tu  A',  Jt,  and  A  j  , , 

I  2  ■it377o2941777u2bf  —  U.3  a  7* r»  1 1  S7H'.‘, !237v><-  -  IT  \ 

’  -  M'  {  1  !>3J3$37'<M7u77<jv  -  Oti  4  '2374i.>54*;yt3:OW  ♦  ov  ■ 


4  2->7  ;0T4  it'1.' ! 3>‘,  ,,  h-  i,‘y  -  i  73i3337-!7(  77'  'r  •  ur 


i  i  i<  yi-iyj.K  -*  ev  -  j  .r.Ji. >.■>,> i -  i 

’*  -  '1'  ~  \.  —3  331  i  1312.U237r>.V  -  17  2  4»,7732P4  j : 


_  ,  4  237  i  ..s  ! j  •-  -  ... ;  7.v;.w.s  •  P  -  ■  7 

•  •’  v  0  2  •|n77'»2'.»  \ ;  7  ,•  « e  —  v  > 

To  nuintaiti  triangularity.  .4',  and  17  are  truncated  by  senn.c  the  appropriate  • 

Let  and  .47'  denote  the  truncated  m.»tr» 1  he  prodm*  A  ~  -t'  should  U«-  dingo: 


j  -3  7  2  ^  2  s  r J * f  '<  >  U  7.  *  *>  \  2<  —  0  * 
i  t)  1  •>.'>.*  7  l-WX'l, 


t.,;  ('ill  ~  1 '  I 


4",,.  = 


A'Ar  - 


-2  2U2>323i>  r'i7,j',,3!.  -  if,  -  !  PI  73'7  i  :i3';i2i  32*  -  27  ^ 
{)  7  7_  -  v  ~  * “  ■  '  a  7  a  74 2 '  ~  -  7  / 

-2  2|.)2>.V2d!M37i;7n4'  -  03  .7  •  h-3  52'  4  7<  i-  -  17 

(i  3  72'.'  v  '!*7':'742'  -•  i.'3  j 


Clearly.  4"(,t  and  ,4",lt,  are  numeticalK  diagonal.  but  ,4"  !f  f  .»sl-  the  ns  tenon  of  diagonald;.  I-orting 
.4 "  j  r  to  be  a  diagonal  matrix  requires  a  truncation  of  (/•  jtr  1  .  vs  inch  o  significant  wit  it  respect 
to  j:.4"jj.  The  matrices  A".j,r  and  .4",..  recjuire  onlv  inMgniftcant  truncations  to  obtain  diagonalt?>. 
hut  we  have  previously  made  ()i  10"1  i  ti  mtcatioiis  duting  thejr  computation  to  force  4 and 
.47  j  .  to  triangular  forms.  Thus,  equal  amounts  of  absolute  truncation  error.-  have  been  committed 
by  ail  three  methods.  The  only  difference  i-  tii.it  the  relative  truncation  error  is  largest  for  tie 

fullv-recursive  method. 

It  is  interesting  to  note  that  if  triangularity  is  not  enforced  and  the  factors  .41  r.nd  .47  a:« 
multiplied,  then  none  of  the  products  can  be  roti'ideted  <li.igi.i,ai  One  may  sa\  that  t S.«-  nunn  n<  al 
diagonaiity  of  .4";,.  and  .4''/..  is  a  conseqijenre  of  th<-  tiuneatioti  to  tiiangulai  f<>rm> 


•n  -47  f„ 


i  i  72s2'iJ'1..'.'  >.<  i  i-U  —  -  pi  pi7  i -•  7  (7  1 .7  ;.)r  -  ji. 

\  1  fil3.7'7i.’.:t'li2'.:),2.  -  27  -2  2  2v'S2';'M?.7o7e$.  -  n*. 


.,  _  (  -2  202>:!2n'i.i37e3b1c  -  03  - 1  *7133'7n4'42^:i2.  -  27  \ 

A'>  hr  »r  ~  {  -2  4<.1f.7iN>l7d71VMc  -  l<i  5  T2a2^iiM>:.r,f'.*,42'  -  (•*»  j 

i  -2  202v'i23!H:i7u7bl.  -  P7  7  Ol'U  I2'fil3f.2:‘n!.  -  17  , 

■'  '  !r  ’■  -  ■'  ~  \  -I  1  To  1 17103b2n27k  -  K>  3  72'2't<'3':'3'i3  y>'  -  03  I 

In  coin  lusioii.  cun  example  shows  that  the  h.dl-iet  ui-ive  ated  direct  methods  piodtice  numericaliv 
identical  results,  while  the  fullv-recursive  method  fails  to  meet  the  diagonahtv  criterion 
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ABSTRACT 

The  problem  of  linearly  constrained  least  squares  has  many  applicati  ns  in  signal  processing  In  this  paper, 
we  present  a  perturbation  analysis  of  a  linearly  constrained  least  squares  algorithm  for  adaptive  beamforming  The 
perturbation  bounds  for  the  solution  as  well  as  for  the  latest  residual  element  are  derived.  We  also  propose  an  error 
estimation  scheme  for  the  residual  element,  which  can  be  incorporated  into  a  systolic  array  implementation  of  the 
algorithm- 


1.  INTRODUCTION 

The  least  squares  problem  with  linear  equality  constraints  has  important  applications  in  signal  processing, 
e.g.,  adaptive  beamforming.  To  solve  this  problem,  McWhirter  and  Shepherd  [5]  proposed  a  systobc  algorithm  and 
architecture.  In  this  paper,  we  present  a  perturbation  analysis  of  the  problem  and  propose  an  error  estimation 
scheme  for  the  McWhirter-Shepherd  (MS)  algorithm  [5).  This  paper  is  organized  as  follows.  The  least  squares 
problem  is  defined  in  Section  2  and  error  bounds  are  derived  in  Section  3.  An  error  estimation  algorithm  is  given 
in  Section  4,  and  in  Section  5  a  numerical  example  is  presented  to  illustrate  how  well  our  new  algorithm  works 

2.  PROBLEM  DEFINITION 

Given  an  n  x  q  complex  data  matrix  X(n),  the  least  squares  problem  with  linear  equality  constraints  is  to  find 
a  g-element  complex  vector  w(n)  such  that 


subject  to  the  linear  constraints 


||A'(n)uj(»)|{  =  min 


Sui(n)  -  b, 


(2.1a) 


(2.1b) 


where  S  is  a  k  x  q  {k  <  q)  complex  matrix  and  b  is  a  A-element  complex  vector.  Throughout  this  paper,  we  use 
the  2-norm: 

l!-||  =  II  ■  Ha¬ 
iti  signal  processing,  new  data  arrives  continuously.  Define  the  data  matrix  A'(n)  recursively  by 


i.e.,  the  nth  row  x(n)T  represents  a  snapshot  at  time  n.  Our  goal  is  to  compute  the  n-th  residual  element 

rn  =  x(n)TU’(n).  (2.2) 
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Is  the  solution  vector  tir(n)  unique?  Define  j(I  +  n)xj  matrix  Sx(n)  by 

*<■>*(*«)■ 

We  assume  that  k  +  n  >  q.  The  solution  is  unique  if  and  only  if  the  matrix  Sx(n)  has  full  column  rank,  that  is, 
the  overdetermined  matrix  equation 

Sx{n)w[n)  =  0  (-3j 

has  a  unique  solution  u>(n)  =  0. 

Next,  we  wish  to  transform  (2.1)  into  a  familiar  unconstrained  problem;  see  [3]  and  [4].  Let 

P  =  9  -  k 


and  partition  the  matrix  5  as 


5  =  (St  Sj ) , 


where  Si  is  Jfc  x  it  and  Sj  is  k  x  p.  For  simplicity,  we  assume  that  Si  is  nonsingular  and  upper  triangular:  for 
example,  5j  may  be  the  result  of  an  initial  QR  decomposition  of  5.  Accordingly,  we  also  partition  X(n)  as 


X(n)  =  (Xl(n)  X2(n)), 

so  that  A'i  is  n  x  it  and  Xj  is  n  x  p.  Then  (2.3)  becomes 

(  *  1  *  i  )  *>( n )  =  0, 

\Ai(n)  X3(n)J 

which  is  equivalent  to 

(5;  cw  )•<->="■ 

where 

C(n)  =  *,(«)-  Xi(n)S~lS2. 

The  matrix  C(n )  is  called  the  Schur  complement  of  Si  in  S*.  The  equation  (2.3)  has  the  trivial  solution  if  and 
only  if  C(n)  has  full  column  rank.  We  proceed  to  eliminate  the  constraints.  Let 


so  that  uq(n)  is  k  x  1  and  u)2(n)  is  p  x  1.  Since 

SiWi(n )  4-  Sjuij(n)  =  6, 

we  get 


Let 

We  derive 


un(n)  =  Sf'6  -  Si'S2w2{n). 
v(n )  =  -Ai(n)Sj_lb. 


(2.4) 


(2.5) 


||C(n)ui3(n)  -  v(n)jj  =  min, 

an  unconstrained  problem  analyzed  in  [3J,  (4).  Now,  what  about  the  residual  element  rn?  Define  the  Schur 
complement  matrix  C(n)  recursively  by 

c<">s(cS/) 
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Partition  the  row  vector  x(n)T  so  that 

x(n)T  =  (  Xi(n)T  x2(n)T  ) , 

where  *i(n)T  is  1  x  k  and  x2(n)T  is  1  x  p.  We  get 

c(n)r  =  x2(n)r  -  *i(n)rSf'S2. 

Let  t'„  denote  the  n-th  element  of  v(n).  The  last  residual  element  of  (2.5)  is  then 

c(n)rw2(n)  -  vn  =  x2(n)7u’3(n)  +  xi(n)TU'i(n)  +  vn  -  v„  =  r„, 

i.e. ,  the  same  residual  element  as  desired  by  the  constrained  problem  (2.1). 

How  do  we  calculate  rn  recursively?  Suppose  that  we  have  available  a  QR  decomposition  of  the  («  -  1)  »  p 
matrix  C(n  -  1): 

C(n  — l)  =  Q(n-  l)R(n-  1), 

where  Q(n  —  1)  is  (n  —  1)  x  p  with  orthonormal  columns  and  the  matrix  /?(n  -  1)  is  p  >  p  upper  triangular.  The 
problem  (2.5)  is  reduced  to 


where  u(n  —  1)  =  Q(n  —  l)wv(n  —  1).  We  iriangularize  the  coefficient  matrix  by  a  unitary  matrix  P.  Then 

n(R{n~\)  u(n  -  1  )\  _  (  R{n)  u(n)\ 

P  {  c(nf  vn  )  -  V  0T  7  )' 

so  that  R(n)  is  p  x  p  upper  triangular.  The  matrix  P  consists  ofp  Givens  matrices.  From  P  and  Q(ti  -  lj  can 
construct  an  n  x  p  orthonormal  matrix  Q(n)  such  that  C(n)  =  Q(n)R(n)  and  u{n)  -  Q(7i)wi(n)  The  desired 
element  r„  is  given  by 

Tn  =  — (ci  . .  ,cp) 7, 

where  ct . cp  denote  cosines  of  the  p  rotations  that  make  up  P. 

3.  PERTURBATION  ANALYSIS 

Elden  [l]  presented  a  perturbation  analysis  of  the  linearly  con'‘.rained  least  squares  problem.  Since  his  theory 
is  general,  it  involves  weighted  pseudoinverses  and  their  corresponding  condition  numbers.  In  this  section,  we 
derive  simpler  perturbation  bounds  for  the  solution  u>(n)  as  well  as  for  the  residual  element  r„.  To  simplify  our 
presentation,  we  will  drop  the  argument  (n)  for  the  matrices  and  vectors,  and  let  k{M  )  denote  the  condition  number 
of  a  matrix  M  with  respect  to  the  2-norm. 

Let  vb  solve  the  perturbed  least  squares  problem 

I!  ( -Xj  +  tExi  X2  +  (Ex, )  u-|i  =  min  (3.1a) 

subject  to  the  perturbed  linear  equality  constraints 

(5l+e£,5t  S2  +  cEs,  )«£'  =  &+ e/i>  (3.1b) 

Suppose  that  t  >  0  is  a  real  variable  and  let 

C  +  t£c  =  (X2  +  tEx,)  -  (Xj  +  tEx,  )(Si  +  tEs,  +  tEs, ) 

and 

v  +  tfv  =  —  ( A'j  +  tEXl){Si  +  tEs, )~ 1  (b  -+  tfh). 
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I 

I 

I 

I 

I 

I 

I 

I 

I 

I 

I 

I 

I 


Recall  that  Si  is  nonsingular  and  that  C  has  full  column  rank.  Suppose  (  is  sufficiently  small  so  that  for  t  r  ;0. 
we  have  St  +  t£s ,  is  nonsingular  and  C  +  tEc  has  full  column  rank.  Let  w(t)  solve  the  matrix  equation 


0  (C  +  t£c)"(C  +  t£c))  U'(0  {(C  +  tEc)H(*~th)) 


St  -f  tEs,  Si  +  tEs, 

( C  +  tEcf(C  +  tE , 


i  :s 


Then  tt’(O)  and  u/(e)  are  solutions  to  problems  (2.1)  and  (3.1),  respectively.  Define  w  =  u(0)  and  w  ~  u  (<)  Then 

w  —  w(  0)  +  eu'(0)  +  0(c3). 

Differentiate  (3.2)  with  respect  to  t  and  set  t  —  0.  We  get 

(o’  Eg  C +  CH  Ec)  +  (  0  CHc]  =  (  Egv  +  CH/„  ) 

Let 

?)•  62 (o'  i).  -0)  -  MO 


Then 


O1  C"c)  =^(CHC)-1  and  ||C||  <  ||<7||. 


Solving  for  tii{0)  in  (3.3),  we  obtain 

*•>-0  ci)"[(A)fe  /MM4MG  4M 

=  S~1(CHC)~lCH 


A-l'f  £)* 


where  r  =  Cwj  -  v  denotes  the  residual  vector.  Furthermore,  by  assuming 

IIAII  <11*11.  II/*  II  <  i|v|i,  ||£cll<l|C;i 

and 


Es, 

0 


<  Mil  ||C||, 


f  MIMM  Sc) 

we  derive  the  inequality 

|R0)H  <  MMl  \\{CHC)-lCH\\  [||d||  -f-  \\C\\  |(5||  ||W||]  +  IIS-1!)  ||(CwC)-l||  ||Cj|  ||r||. 


(3  3) 


(3.4) 

(3.5a) 

(3.5b) 


P 

I 

» 


I 

I 


Consequently,  we  obtain  the  following  perturbation  result. 

Lemma.  Using  the  notations  defined  above  and  assuming  that  e  in  (3.1)  is  sufficiently  small  so  that  the  inequalities 
(3.5)  ate  satisfied ,  we  get 


[|w  ~  ttij 
INI 


<.{«(5WC,(j, 


Ml 


<?||  Mil  IN 


+ 1  +  *(S)/c(C)3 


Mil  Mil  IMi  ■ 


+  0((2).  a 


(3-6) 


To  illustrate  the  effect  of  k(S)  on  the  solution  of  (2.1),  consider  a  simple  example  in  which  5  =  (  5j  0 )  and 
X  r=  (fn  I„  ),  where  /„  is  an  n  x  n  identity  matrix  and  n  <  k  <  2 n.  By  observation,  uq  =  S^b  and  u’j  =  - u’i 
Since  k(S)  =  k(Si)  in  this  example,  we  see  why  the  presence  of  a(5)  is  necessary  in  (3.6). 
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We  proceed  to  derive  a  bound  for  the  error  in  the  residual.  Let 


/  0  'N  _  /S,  +lEs, 


Si  4  tEs 
C  4  tEc 


1  j  u-(t)  -  ^ 


b+tf>  \ 
v  4-  t/„  } 


differentiate  the  equation,  and  then  set  i  -  0.  Using  (3.4)  to  substitute  for  u>(0),  we  get 


UMt 


=(/-«<) [(^  % )»-* 


-  C(CMC) 


H  1 


where  <5*  =  {CH C)~lCH .  Consequently, 

r(0)  =  (/-  CC')(fc^3  -  /.)  -  C(ChC)-xE2t. 


As  for  the  residual  element  we  have  rn  =  e^r,  where  e„  5  (0,  ...,0,1)  denotes  the  n-th  unit  coordinate  vector 
Using  the  assumptions  (3.5a)  and  noticing  that  u>2  =  C*v  and  r  =  (/  -  CC*)v,  we  derive  our  major  result 


Theorem.  Under  the  same  conditions  as  in  the  Lemma,  we  get 

<  «  [11/  -  CC'\\(2k(C)  4  1)]  +  0(e2) 

and 


rn  -  r„ 


<  e  [||/  -  CC'||(k(C)  4  lie’ll  ||C»e,||  4  1)]  4  0(c2).  o 


(3.7) 


(3.8) 


Here  are  some  additional  remarks.  If  we  set  S  —  I  and  f>  —  0,  then  (3.6)  leads  to  a  perturbation  bound  for  the 
standard  least  squares  problem  [2].  We  also  note  that  (ICSu^l2  4  ||vjj2  =  l|dj|2.  Thus,  we  can  define 


COS0  =  J|CSw||/||rf|| 


and  use  (l/cos0)  and  tan#  in  (3.6).  The  bound  (3.7)  is  similar  to  a  result  derived  in  [2).  The  inequality  (3.8) 
indicates  that  |fn  -  rn|  depends  on  k(C)  as  well  as  on  |)v||.  Both  (3.7)  and  (3.8)  can  be  simplified  by  using  the 
relation  that  ||/  -  CC1) |  =  min{l,  n  -  pj. 


4.  ERROR  ESTIMATION 

Although  the  error  bound  (3.8)  is  simple,  it  requires  C*en,  whose  computation  involves  at  least  a  back-so)ve. 
In  this  section,  we  present  an  error  estimation  scheme  for  the  desired  residual  element.  When  the  new  dala  vector 
x(n)T  arrives,  it  is  first  processed  by  5  so  that  *i(n}T  is  annihilated.  In  particular,  let 

(z<0)  ...  z^)  =  x{n)T  and  u*°*  =  0. 

Then  the  preprocessing  proceeds  as  follows: 

*i,i  *1,1+1  ■  •  •  *i ,7  ^  (  I  0\  (  *i,i  ■  ■  ■  *i,i 

»  ...  4"i  =  U.  .&*’  -■  4"11 

°)  („<(-»)■ 
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for  /  =  1,2, ....  Jfc,  where  gi  =  z}  Writing  in  algorithmic  form,  we  have 


for  l  —  1, 2, . . .,  k 


9i  =  4 7*u ; 
for  7  =  /  + 


«1,J  =  , 


u(0  —  u('-1)  —  gtbl 


The  above  process  shows  that 

(4Vi  •  •  •  4k))  =  x*(nf  -  *l(*)TSf  lS2  and  u(i)  =  -*i(n)TSf '6. 

These  two  variables  are  then  used  for  updating  the  QR  decomposition  of  C{n  —  1)  and  computing  the  residual 
element.  We  present  below  the  algorithm  derived  in  ( 4 J. 

for  1  =  1,  2, . .  .,p 


n  (n  —  1)  /  (») 

cos  0\  -  c) ,  7c)  ,  ; 


.  (fc+r—  1)  ,  (n) 

sinfl,  =zi+,  7c), ii 

for  j  =  f  + 


c|?  =  c{"  1 }  cos  01  +  1 )  sin  &i ; 

4+^*  =  ~cj"'l)sin0j  +4‘4'-,)c°s<l, 
end; 

vjn^  =  ti)n~l)cos0j  +  v(i+l~l)sin0/ ; 
u(*+0  =  —  sin  0t  +  u(k+,~1f  cos0i 


rn  =  U(*+P)  cos  0, . 

In  the  above,  cj**  (for  ib  =  n  —  l,n)  denotes  the  (i,  j)-element  of  C(k)  and  vjk*  the  i-th  element  of  v(k). 
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Now,  we  discuss  an  ertoi  estimation  scheme  fot  the  preprocessing.  Let  :  denote  the  corresponding  computed 
value  and  fl  the  floating  point  computation.  In  the  above  procedure  we  calculate 

9i  = 

i(l }  =  fl( 

Define  the  relations  between  the  exact  and  computed  quantities  as  follows; 

01  =  ?i(l  +  «(&(<)), 

i")  =  »«*)(1  +  J)(,)fl(,)(e))l 

6;  =  6,(1  +  fll6i(f)), 

where  [&,y(e)|  =  0(e),  |V>]')(e)|  =  0(e),  |£,(e)|  =  O(e),  |0f,)(c)|  =  O(e)  and  (<i(e)|  =  O(e).  The  five  quantities 
(j‘\  <*ii  nnd  m  are  all  real  and  nonnegative.  We  also  assume  that  the  errors  such  as  and  (j^vj^(e) 

are  so  small  that  higher  order  terms  like  (<r[,i^i,i(e))J  and  (<r,1i0i,,(«))((,(i-1  V|,-1){<))  can  be  ignored.  Using  the 
lemma  in  [3],  we  obtain  the  following  algorithm  for  estimating  the  errors  in  preprocessing. 


for  /  =  1,2 ,...,£ 


begin 


<*i  =  maJC{Cj!  l  \  ffij }  ; 
for  j  =  l  +  1, . . . ,  q 


v 


end. 


As  explained  in  [3],  the  above  estimation  scheme  can  be  incorporated  with  the  preprocessing  procedure  and  imple¬ 
mented  on  the  same  systolic  architecture.  Additional  time  is  minimal  because  the  calculations  can  be  carried  out 
during  the  otherwise  idle  time  of  the  processors. 

The  error  estimate  for  r„  can  be  obtained  by  the  algorithm  presented  in  [3]  using  (Ck+i  ’ ' '  C«k))  an<^  ns 
the  error  estimates  for  (zj;^  •••  Zq**)  and  respectively.  Again,  we  list  the  error  estimation  algorithm  and 
refer  the  details  to  [3].  Define  the  relations  between  the  exact  and  computed  quantities  as  follows: 


and 


^  ~  c»!>^ 1  +  A())< 

cos0,  =  cos  0,(1  + 

i<*>  =  «(*)(1  +  ,(*¥*)(€)), 

sin  0,  =  sin  0,(1  +  <rlfi^|‘,,(c))l 

f,(n’1)  =  v{,n~l)(l  +  (..p+i“.,p+i(<)). 

i[’n)  =  u,(T,)(l  +  <r,iP+i  d»,.p+i(« 

fn  =  rn(l  +  q0(e)). 
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The  following  algorithm  estimates  the  error  in  the  last  element  of  the  residual  vector: 


for  l  =  1,2, . .  .  ,p 
begin 
<rM  = 

for  j  =  1+  l,...,p 
begin. 

Ici”-*’  co*  9i  mu{(>  JtOi.i}l  +  |*i‘ J1**’  *inS)  '*  ,o(.i  }| 

=  - |c!r»c0,S,^-»„n8|, -  -  ! 

.(i+k)  _  le!T1>  *in9<  eo>8i 

*+J  |«‘7*»  co*S|| 

end; 


ft.p+i  — 


_  co*  ffj  mu((i  ,+i,<ri  iH-t-lu^*1-11  *in6,  ,oi  i}l 


WT^CO.S.+U.*— - sinS|| 


(k+l)  _  I*’,1"-11  »»nSi  maxffi  ,+  i,<ri  i}l  +  [«(t'H~‘)  co*8|  11  ,g|  ,}| 

^  |*ij  11  *in ffi -u<*l+*~,l  eo*0i| 

end; 

V  =  7j(*+J0  max{<rlill  •  •  • ,  <rPiP}  . 


5.  AN  EXAMPLE 

The  example  in  this  section  shows  that  the  computed  residual  element  may  be  accurate  even  when  the  matrix 
C  is  ill-conditioned.  In  this  case,  the  proposed  scheme  gives  a  better  error  estimate  than  (3.8).  Both  the  MS 
algorithm  and  the  error  estimation  algorithm  were  implemented  using  MATLAB  and  run  on  a  VAX  8300  with 
machine  precision  e  =  1.1102  x  10' 18  in  the  Communications  Research  Laboratory  at  McMaster  University. 

Example.  Suppose  the  exact  constraint  matrix  and  corresponding  right  side  vector  are 


/ 103 

0 

0 

1 

0  0\ 

/-120000-y/2/7\ 

S  =  0 

10~3  0 

0 

1  o) 

and  5=  n/10/700  j 

V  0 

0 

1 

0 

0  1/ 

\  6v/5/7  / 

Thus  we  set  the  error  estimates  as  <rtiJ  =  m  -  l  and  =  6|( f)  =  (■  The  data  matrix  at  time  n  -  1  is 

/-I  -v/5  -2%/TO  0  0  0\ 

X(n  -  1)  =  0  -1  v/2  000  . 

\0  0  -1  000/ 

Suppose  we  know  the  exact  R(n  -  1)  and  u(n  -  1): 


/  0.001 

1000V^ 

2-/I0  \ 

/  —  10v/2/7  \ 

o 

II 

1 

e 

05 

1000 

-2V2 

and  u(n  —  1)  =  I  4n/10/7  ) 

V  o 

0 

1  ) 

\  6V5/7  ) 

Similarly,  the  error  estimates  of  their  elements  are  all  initialized  as  e.  Now  the  new  data 

x(n)T  =  (-l  -v/5  -2v/l0  0.001  0  0)  and  u(0)  =  0 
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are  available  and  their  error  estimates  are  initialized  as  <;(0)  =  1,  for  j  =  1, .  . 6,  and  r7(0)  =  1,  respectively.  After 
preprocessing,  we  get  c(n)T  =  (0.002  lOOOVo  2^/lQ)  and  vn  =  ~10y/2/7.  The  corresponding  error  estimates 
are  (j3^  =  1,  for  j  =  4,5,6,  and  r/3>  =  23.  The  QR  updating  scheme  and  its  error  estimation  algorithm  are  then 
applied  to  R(n  —  1),  u(n  —  1),  c(n),  vn  and  their  error  estimates.  The  exact  residual  element  r„  =  6\/2/35.  The 
computed  error  is 

|fn-rn!=  1.11  x  10"le. 

The  condition  number  of  C(n)  is  4.6  x  10®  and  the  error  bound  as  given  by  (3.8)  equals  3.40  x  10~9.  The  estimation 
algorithm  gives  a  much  more  accurate  value  of  9.62  x  10" ie. 

ACKI  OWLEDGEMENTS 

The  work  of  F.  T.  Luk  was  supported  in  part  l,,  the  Rensselaer  Polytechnic  Institute,  and  by  the  Army  Research 
Office  under  grant  DAAL03-90-G-0104  and  the  Joint  Services  Electronics  Program  under  contract  F49620-90-C- 
0039  at  Cornell  University.  The  work  of  S.  Qiao  was  supported  in  part  by  Natural  Sciences  and  Engineering 
Research  Council  of  Canada  under  grant  OGP0046301. 

REFERENCES 

[1]  L.  Elden,  “Perturbation  theory  for  the  least  squares  problem  with  linear  equality  constraints,”  SIAM  J.  Numer. 
Anal.,  Vol.  17,  No.  3,  June  1980,  pp.  338-350. 

[2]  G.  H.  Golub  and  C.  F.  Van  Loan,  Matrix  Computations,  Second  Edition,  The  Johns  Hopkins  University  Press, 
Baltimore,  Maryland,  1989. 

[3]  F.  T.  Luk  and  S.  Qiao,  “Analysis  of  a  recursive  least-squares  signal-processing  algorithm,"  SIAM  J.  Sci.  Siai. 
Comput.  Vol.  10,  No.  3,  May  1989.  pp.  407-418. 

|4]  J.  G.  McWhirter,  “Recursive  least-squares  minimization  using  a  systolic  array,”  Real  Time  Signal  Processing 
VI,  K.  Bromley,  Ed.,  Proc.  SPIE  431,  1983,  pp.  105-112. 

[5]  J.  G.  McWhirter  and  T.  J.  Shepherd,  “A  systolic  array  for  linearly  constrained  least-squares  problems,” 
Advanced  Algorithms  and  Architectures  for  Signal  Processing,  J.  M.  Speiser,  Ed.,  Proc.  SPIE  696,  1987,  pp.  80- 
87. 


218 /SPIE  Vol.  1770(1992} 


RPI 


Department  of  Computer  Science 

Technical  Report 


COMPUTING  THE  SINGULAR  VALUE  DECOMPOSITION 
ON  A  FAT-TREE  ARCHITECTURE 


TONG  J.  LEE 

School  of  Electrical  Engineering 
Cornell  University 
Ithaca,  New  York  14853,  USA 


FRANKLIN  T.  LUK 
Department  of  Computer  Science 
Rensselaer  Polytechnic  Institute 
Troy,  New  York  12180,  USA 


DANIEL  L.  BOLEY 
Department  of  Computer  Science 
University  of  Minnesota 
Minneapolis,  Minnesota  55455,  USA 


Rensselaer  Polytechnic  Institute 
Troy,  New  York  12180-3590 


Report  No. 


92-33 


November  1992 


226 


COMPUTING  THE  SINGULAR  VALUE  DECOMPOSITION 
ON  A  FAT-TREE  ARCHITECTURE 

TONG  J.  LEE 

School  of  Electrical  Engineering 
Cornell  University 
Ithaca.  Xew  York  If $53.  USA 
tjlee  See. Cornell. edit 

FRANKLIN  T.  LUK 
De/Mirtment  of  Computer  Science 
Rensselaer  Polytechnic  Institute 
Troy.  .Yete  York  12180.  USA 
luk  Qcs.rpi.edu 

DANIEL  L.  BOLEY 
Department  of  Computer  Science 
University  of  Minnesota 
Minneapolis.  Minnesota  55f55.  USA 
boley  Qcs.umn.edu 


ABSTRACT.  The  Singular  Value  Decomposition  (SYD)  is  a  matrix  tool  that  plays  a  critical 
role  in  many  applications:  for  example,  in  signal  processing,  it  is  often  necessary  to  calculate 
the  SVD  in  real  time.  We  present  here  a  new  technique  for  computing  the  SYD  on  a 
parallel  architecture  whose  processors  are  connected  via  a  fat-tree.  We  tested  our  idea  on 
the  Connection  Machine  CM-5.  and  achieved  efficiency  up  to  40V(  even  for  moderately  sized 
matrices. 

KEYWORDS.  Singular  value  decomposition,  parallel  Jacobi  algorithm,  fat-tree.  CM-5. 
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1.  Introduction 

Let  .4  be  a  reai  m  x  n  matrix.  Its  singular  value  decomposition  (SVD)  is  given  by 

.4  =  IZVT, 

where  U  and  V  are  respectively  m  X  m  and  n  X  n  orthogonal  matrices  and  L  is  an  m  x  n 
diagonal  matrix.  The  best  approach  to  parallel  SVD  computation  is  apparently  one  of  the 
Jacobi  type:  see.  e.g.,  [1],  [2].  [4],  [5].  [7],  [11],  [12].  In  this  paper,  we  will  discuss  the  efficient 
implementation  of  a  Jacobi  method  on  a  parallel  computer  with  a  fat-tree  interconnection 
network.  We  will  propose  a  new  Jacobi  ordering  for  a  fat-tree  and  analyze  its  behavior 
both  theoretically  and  experimentally  (on  a  Connection  Machine  CM-5). 

This  paper  is  organized  as  follows.  In  the  next  two  subsections,  we  present  the  fat- 
tree  architecture  and  Jacobi  algorithm.  Section  2  introduces  a  new  fat-tree  ordering,  and 
provides  some  kernel  programs.  We  analyze  communication  costs  on  a  fat-tree  network  in 
Section  3,  and  discuss  implementation  results  on  the  CM-5  in  Section  4. 

1.1.  Fat-Tree  Architecture 

The  fat-tree  was  introduced  by  Leiserson  [10]  as  a  novel  approach  to  interconnect  the 
processors  of  a  general-purpose  parallel  supercomputer.  This  communication  structure  can 
also  be  seen  in  the  distributed  computing  environment,  such  as  a  network  of  workstations. 

The  routing  network  of  the  Connection  Machine  CM-5  [14]  is  based  on  the  fat-tree.  This 
parallel  machine  consists  of  up  to  544  (=  512  4-  32)  nodes  for  the  model  at  the  Army  High 
Performance  Computing  Research  Center  (AHPCRC)  at  the  University  of  Minnesota,  and 
32  nodes  at  the  Northeast  Parallel  Architectures  Center  (NPAC)  at  Syracuse  University. 
Each  node  of  the  CM-5  is  a  SPARC  chip  which  runs  at  32  MHz  and  delivers  22  Mips  and 
5  Mflops.  There  is  a  G4  Kbyte  instruction  and  data  cache  and  a  16  Mbyte  memory  in 
each  node.  All  the  nodes  are  synchronized.  In  October  of  1992.  two  vector  units  will  be 
installed  in  each  processing  node:  each  vector  unit  is  capable  of  64  Mflops  peak  and  40 
Mflops  sustained  [9].  The  control  and  data  networks  are  connected  via  a  skinny  fat-tree 
structure.  By  skinny,  we  mean  that  the  bandwidth  does  not  increase  proportionately  to 
the  number  of  nodes:  in  particular,  the  bandwidth  is  20Mbyte/sec  per  node  in  a  group  of 
four  processors.  10  Mbyte/sec  per  node  in  a  group  of  sixteen,  and  5Mbyte/sec  overall.  So 
data  contention  may  severely  degrade  performance  when  all  nodes  need  to  access  a  large 
set  of  data  from  other  nodes  through  the  top  level  of  the  tree. 

1.2.  Jacobi  Algorithm 

The  one-sided  Jacobi  method  [8]  generates  an  orthogonal  matrix  V'  such  that  the  columns 
of  the  matrix  H\  given  by  \V  =  AV ,  are  mutually  orthogonal.  The  matrix  V'  can  be 

generated  by  a  sequence  of  plane  rotations  . where  each  is  an  identity 

matrix  except  for  four  entries:  rjf*  =  cos #.  =  -  sin#.  =  sin#  and  cj**  =  cos#, 

where  (i.j)  represents  the  index  pair  of  the  columns  of  .4  that  U’^  orthogonalizes.  The 
SVD  computation  requires  0(mn2)  operations  for  an  m  x  n  matrix  .4.  For  a  limited 
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number  of  processors,  i.e  .  up  to  rt/2  processors,  an  efficient  way  is  to  configure  t  hem  a-  a 
linear  array  along  the  horizontal  dimension.  Columns  can  be  distributed  either  in  blocks 
or  in  a  wraparound  fashion.  Note  from  the  abuse  derivation  that  each  column  parr  can  be 
orthogonalized  independently,  so  that  we  may  transform  up  to  y  pairs  concuro-nth .  w  h-re  <> 
denotes  the  number  of  processors.  T his  method  was  used  for  computing  the  > \  1 )  <m  'p«-»iai 
machines,  e.g..  parallel  computers  such  a.s  the  Uliac  i\'  ’ll:  and  vector  processor'  -  u<  h  a' 
the  CYBER  20">  (3|.  The  one-sided  Jacobi  method  is  composed  of  these  major  step-. 

1,  Compute  the  norm  of  each  column 

2  Compute  plane  rotations  to  orthogonalize  paired  columns. 

3.  Apply  the  plane  rotations  to  update  the  column'  and  the  column  norms 

4.  Permute  the  columns  in  a  pre-clio-en  order  to  generate  the  next  column  pans,  ami 
repeat  the  process  from  step  2  . 

If  the  column  pairs  are  distributed  to  different  processors,  then  Mep  I  requires  <  omnium 
cation.  In  the  case  of  a  two-dimensional  mesh  fas  in  the  1L1.1AC  IV  s,  each  tohimn  i-  itself 
distributed  among  different  processors  and  step  3  requires  that  the  rotation  parameters 
be  transmitted  to  all  the  processors  containing  each  given  column  pair.  Iri  the  case  of  a 
one-dimensional  array,  each  column  pair  is  stored  entirely  in  one  processor  and  significant 
speedup  is  possible  if  vector  units  are  present  within  each  processor. 

In  this  paper,  we  use  the  one-dinieti'ioual  array,  with  each  processor  storing  two  blocks 
of  columns.  That  is,  we  use  a  block  Jacobi  algorithm,  in  which  the  column  block'  are 
circulated  according  to  a  given  ordering  to  be  defined,  and  the  <\r!ich\  -row  -  ordering  ;t»j 
is  used  within  each  block. 

2.  New  SVD  Algorithm 

In  the  past,  when  the  hypercube  interconnection  topology  was  in  vogue,  several  Jacobi 
ordering  schemes  were  proposed  ( I j.  [4 j.  (7j  to  utilize  t lie  hypercube  structure.  Here,  for 
a  one-dimensional  array  of  processors  with  no  wraparound,  a  chess-tournament  ordering 
[2]  may  be  chosen  because  it  does  not  waste  processing  power  or  memory  space.  However, 
communication  requires  a  two-way  transmission  of  columns  between  ad  jacent  processors. 
An  alternative  is  a  ring  ordering  [4]  which  uses  only  one-way  transmission,  but  it  requires 
a  wraparound  connection.  To  develop  an  ideal  ordeiing  for  a  fat-tree,  we  ami  to  minimize 
the  total  path  length  by  using  the  extra  bandwidth  of  a  fat-tree. 

2.1.  Fat-Tree  Ordering 

It  is  easiest  to  describe  this  ordering  by  an  example.  In  Figure  1  we  show  the  case  for 
sixteen  columns  and  eight  processors.  For  pedagogic  reasons,  we  use  a  base  8  numbering 

of  the  indices  and  so  A=8.  B=9 . H=15.  The  XOR  (exclusive-or)  column  is  the  binary 

XOR  of  the  column  indices:  at  each  step,  the  XOR  value  of  each  index  pair  is  the  same. 
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and  from  one  step  to  the  next  this  quantity  follows  the  Gray  code.  The  cost-to-fh/'s-srep 
column  denotes  the  maximum  number  of  levels  up  the  tree  the  messages  must  travel  to 
reach  their  destinations  from  the  previous  step.  In  general,  if  there  are  p  processors  and 
two  columns  per  processor,  then  a  sweep  requires  2 p-  2  steps.  We  save  one  step  per  sweep 
because  the  last  step  of  sweep  i  can  be  included  as  the  first  step  for  sweep  i  +  1. 
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6. 
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Ordering  of  Index  Pairs 
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Forward  Sweep 
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(04) (15) (26) (37) (AE) (6F) (CG) (DH) 
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(OH) (1G) (2F) (3E) (4D) (50 (6B) (7A) 
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(03) (12) (47) (56) (AD) (BC) (EH) (FG) 

(OI)  (23) (45) (67) (AB) (CD) (EF) (GH) 
Forward  Sweep 

(03) (12) (47) (56) (AD) (BC) (EH) (FG) 
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1 

Figure  1.  Fat-tree  Ordering  based  on  the  Gray  code 
(eight  processors  and  sixteen  columns). 
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2.2.  Kernel  Programs 

To  see  how  to  write  a  simple  node  program  to  generate  the  fat-tree  ordering,  we  use  the 
following  observations  from  the  example  in  Figure  1.  To  simplify  the  presentation,  we 
consider  only  the  forward  sweep.  At  each  step,  each  processor  must  communicate  with  a 
remote  processor  whose  label  differs  in  one  bit.  The  basis  for  our  kernel  presented  here  is 
to  compute  a  mask  such  that  the  exclusive-or  of  the  mask  with  the  current  processor  label 
yields  the  remote  processor  label.  When  using  the  Gray  code,  this  mask  can  be  computed 
using  only  the  step  number  -  it  is  independent  of  the  processor  label. 

We  also  use  the  following  observations.  First,  we  use  the  fact  that  the  XOR's  follow  the 
Gray  code.  Second,  we  observe  that  during  the  second  half  of  the  forward  sweep  (steps 

7-14).  the  lower  half  of  the  columns  (numbers  0 . 7  in  Figure  1)  remain  fixed  in  the 

processor  with  the  same  number.  Hence  the  location  of  the  remaining  columns  is  fixed 
entirely  by  the  Gray  code.  Third,  we  observe  that  the  first  half  of  the  steps  (steps  0-6) 
amount  to  doing  a  Gray  code  fat-tree  ordering  on  each  half  of  the  processor  array  separately. 
The  only  remaining  step  is  the  transition  from  the  first  half  to  the  second  half  (step  6  to 
step  7).  Hence  we  can  define  the  ordering  for  these  steps  recursively  from  the  smaller  cases. 

We  can  summarize  the  steps  for  the  forward  sweep  in  the  following  procedure,  in  a 
pseudo-MATLAB  notation  assuming  for  the  sake  of  simplicity  of  the  presentation  that  the 
sends  do  not  block. 

*/.  Node  program  for  processor  ProcNo  for  one  forward  sweep  using  an  array  of 
*/,  NProcs  processors.  Assume  Column(l)  and  Column(2)  are  the  head  and  tail 
columns,  respectively,  in  the  local  memory. 

Orthogonalize.Individual.Column.Blocks  '/,  (within  each  block)  ; 

for  StepNo  =  l:2*NProcs-2. 

Pairwise. Orthogonal ize_Column_Blocks ; 

*/,'/,  for  each  processor,  figure  where  the  data  goes  to  and  send  it. 

[Mask.ColumnSwitch]  =  MakeMaskCStepNo .ProcNo .NProcs) ; 

RemoteProcNo  =  X0R(ProcNo .Mask) ; 

Send  Column(2)  to  remote  processor  RemoteProcNo; 
if  ColumnSwitch  ==  rotate, 

Column(2)  =  Column(l); 

Column(l)  =  receive.from(RemoteProcNo) ; 
else 

Column(2)  =  receive.from(RemoteProcNo) ; 
end; 

end; 
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function  [Mask, ColumnSwitch] *MakeMask (StepNo, ProcNo, NoProcs) ; 

7.  ColumnSwitch  indicates  which  column  of  the  pair  is  to  be  sent/received. 

%  Mask  is  the  XOR  Mask  so  that  RemoteProcNo  *  XOR(ProcNo.Hask) . 

X  The  Mask  is  computed  independent  of  the  processor  label  ProcNo, 

X  Handle  first  2  steps  as  special  cases  to  start  recursion 
if  StepNo  <*  2, 

Mask=l ; 

ColumnSwitch  =  tail; 

if  rem(ProcNo,2)  **14  StepNo  ==  1,  ColumnSwitch  =  rotate;  end; 

7,  First  half  of  sweep:  pretend  this  is  a  separate  fat  tree  sweep  on  each 
7.  half  of  the  processor  array, 
else  if  StepNo  <  NoProcs-1, 

[Mask .ColumnSwitch]  *  MakeMask(StepNo ,rem(ProcNo ,NoProcs/2) ,NoProcs/2) ; 

7.  Middle  of  sweep:  here  is  first  exchange  through  top  of  tree, 
else  if  StepNo  ■■  NoProcs-1, 

Mask  *  NoProcs/2; 

ColumnSwitch  *  tail; 

if  ProcNo  >*  NoProcs/2,  ColumnSwitch  *  rotate;  end; 

7.  Last  half  of  sweep:  only  tail  columns  move,  figure  Mask  using  Gray  codes, 
else  if  StepNo  >  NoProcs-1, 

Mask  *  xorCgray (StepNo) .gray ( StepNo* 1)) ; 

ColumnSwitch  *  tail; 

end; 

2.3.  Test  of  Convergence 

For  a  fat-tree  ordering,  any  consecutive  2p - 2  (or  even  2p-  1)  steps  may  not  constitute  one 
sweep.  We  must  complete  a  sweep,  either  forward  or  backward,  to  ensure  that  all  column 
pairs  have  been  orthogonalized.  The  convergence  test  is  simple.  We  maintain  a  one- bit 
counter  in  every  processor.  The  counter  is  reset  at  the  beginning  of  every  sweep,  and  is  set 
whenever  a  column  pair  needs  to  be  orthogonalized.  At  the  end  of  the  sv  <rp,  a  global  or 
operation  is  performed  and  convergence  is  achieved  if  no  bit  has  been  set. 
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3.  Analysis  on  a  Binary  Fat-Tree  Network 

We  consider  a  binary  fat-tree  with  p  processors,  and  assume  that  the  communication  time 
from  one  processor  to  another  is  determined  by  the  number  of  links  a  message  lias  to 
traverse  and  the  capacity  of  these  links.  Our  assumption  is  supported  by  experimental 
results  reported  in  [13].  Define  a  channel  to  be  the  communication  link  between  any  two 
adjacent  nodes:  here  a  node  can  be  a  processor  or  an  internal  switching  element.  The 
capacity  of  a  channel  equals  the  number  of  parallel  wires  in  the  channel,  and  thus  the 
maximum  number  of  simultaneous  bit-serial  messages  it  can  support  [10].  Denote  the 
capacity  of  the  channels  at  the  bottom  level  by  Label  the  levels  from  bottom  up  as  level 
1.  2.  so  that  the  capacity  of  the  channels  at  level  /  is  given  by  2‘-1-.  Let  us  ignore 
start-up  and  latency  costs.  Within  a  single  problem,  all  the  messages  have  the  same  size 
and  thus  we  measure  the  cost  of  multiple  message  transmission  using  path  length. 

For  the  ring  ordering,  at  each  step  a  message  always  goes  through  the  top  level  and  the 
maximum  path  length  equals  2 log p  (unless  otherwise  stated,  we  use  base  2  logarithms). 
Since  there  is  at  most  one  message  at  each  channel,  congestion  never  occurs  and  it  takes 
2p—  1  steps  to  complete  one  sweep.  The  total  path  length  equals  (4 p  -  2)log p. 

The  fat-tree  ordering  does  not  cause  congestion  on  a  fat-tree  network.  Hence  it  suffices 
to  count  the  number  of  times  that  each  level  is  used.  Denote  that  count  by  c(p ,  /).  Consider 
the  forward  sweep.  We  see  from  Figure  1  that  with  p  =  S  processors,  the  top  level  is  used 
in  two  transition  steps,  the  middle  level  in  six  steps  and  the  bottom  in  fourteen  steps.  The 
first  six  steps  correspond  to  the  fat-tree  ordering  for  the  first  four  processors,  and  also  for 
the  second  four  processors.  In  the  general  case  of  p  processors,  there  are  2 p  -  2  steps  using 
logp  levels,  of  which  the  first  p-  2  steps  amount  to  the  ordering  for  p/2  processors.  When 
the  number  of  processors  doubles  to  2 p.  we  add  a  new  top  level  and  the  first  2 p  -  2  steps 
correspond  exactly  to  the  p  processor  ordering.  There  are  an  extra  2 p  steps,  of  which  two 
use  the  new  top  level,  four  use  the  next  level  (the  old  top  level),  eight  use  the  following 
level,  etc.  Formally,  we  get  the  recurrence 

c(2p.l)  =  c(p.l)  +  A(p/2l)  for  /  =  1 . logp. 

starting  with  c(p.logp)  =  2  and  c(p./)  =  0  for  /  >  logp.  Therefore.  c(p./)  =  4p/2*  -  2.  and 
the  total  path  length  is  given  by 

logp 

2  c(p. / )  =  2[(2o  -  2)  +  (p  -  2)  +  . . .  +  14  +  6  +  2]  =  Sp  -  4  logp  -  8. 

/=  l 

For  a  large  p,  the  path  length  ratio  of  the  two  orderings  grows  like  log  p/2,  a  very  attractive 
result  for  our  new'  ordering. 
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4.  Connection  Machine  CM-5 

Although  the  CM-5  network  is  a  Away  tree,  the  analysis  on  2-way  trees  is  applicable.  We 
take  a  4-way  tree  and  expand  every  interior  node  into  a  binary  tree  consisting  of  that  node 
with  two  new  children  each  connected  to  two  of  the  four  former  children.  The  number  of 
levels  as  well  as  the  path  length  are  doubled.  However,  the  CM-5  is  skinny  and  the  capacity- 
only  doubles  at  every  level.  Hence  it  becomes  a  skinny  2-way  tree  in  which  the  capacity 
goes  up  by  \/2  at  each  level. 

To  simplify  our  analysis,  we  concentrate  on  the  32-processor  model.  So  p  =  32  and  there 
are  three  tree  levels  because  flog4  p]  =  3.  The  dominating  communication  cost  for  the  CM-5 
is  the  overhead  time  that  is  spent  on  address  calculation,  buffer  space  management,  and  so 
on.  Let  tor  and  t^j  represent  the  cost  of  such  overhead  in  each  step  for  the  ring  and  the 
fat-tree  ordering,  respectively.  Let  tcj  be  the  overhead  cost  for  resolving  contention  in  the 
channels  of  the  CM-5  network  when  applying  the  fat- tree  ordering,  and  let  te  be  the  time 
for  traversing  an  edge  in  the  network.  We  note  that  te  <  tcj  <  where  t0h  6  {f©r.f0/}, 
tcj  *  t0h,  and  te  €  (f^/lO3,  f©/,/102}.  The  overheads  t0T  and  t0f  depend  on  the  data  size 
and  are  of  equal  magnitude. 

We  proceed  to  compute  the  coefficient  for  te ,  which  we  assume  to  equal  the  number 
of  messages  that  traverse  the  channels  in  one  sweep.  For  the  ring  ordering,  there  is  no 
congestion  in  the  networks.  So  the  coefficient  for  te  is  2  -  63  -  3  (=378).  and  the  total  cost 
equals  63  t0T  +  378  te.  For  the  fat-tree  ordering,  we  observe  that  level  1  is  visited  62  times, 
level  2  fourteen  times,  and  level  3  two  times.  We  model  the  resolution  of  the  contention 
bv  sending  messages  in  batches.  Messages  through  level  2  must  be  sent  in  two  batches  and 
messages  through  level  3  in  four  batches,  in  order  to  avoid  contention.  Hence  we  account 
for  the  thinness  of  the  CM-5  network  by  assigning  a  weight  of  two  to  level  2  and  a  weight 
of  four  to  level  3.  The  total  path  length  is  2(62  +  2-14  +  4-2)  =  196  and  the  total  cost 
equals  62  t0j  +  196  te  +  tcj.  Thus,  on  the  CM-5  the  fat-tree  ordering  may  not  outperform 
the  ring  ordering  because  of  the  extra  cost  associated  with  message  contention. 

4.1.  Experimental  Results 

In  Table  1  we  present  implementation  results  on  a  32-node  CM-5  for  random  nxn  matrices 
with  n  ranging  from  64  to  1024.  The  program  was  written  in  Fortran  and  each  experiment 
repeated  ten  times.  We  measured  the  overall  and  computation  (by  disabling  communica¬ 
tion)  costs  for  one  sweep,  and  estimated  the  communication  cost  by  subtracting  the  latter 
from  the  former.  Our  results  show  that,  despite  the  message  congestion  that  it  causes  on 
the  CM-5.  the  fat-tree  ordering  gets  more  competitive  as  n  grows,  justifying  our  effort  to 
minimize  the  total  message  path  length  (see  also  [13]).  The  mflops  (million  floating-point 
operations  per  second)  figures  in  Table  2  are  computed  based  on  the  count  that  8n3  flops 
are  required  for  one  sweep.  We  conjecture  that  the  compute  performance  deteriorates  when 
n  gets  beyond  512  because  the  cache  is  no  longer  large  enough  to  hold  the  huge  column 
blocks.  Nonetheless,  our  implementation  results  shows  how,  as  the  message  size  increases 
(hence  tt  increases  [13]).  the  fat-tree  ordering  quickly  becomes  competitive. 
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n 

64 

128 

256 

512 

1024 

Overall 

Ring 

7.595  e~J 

3.229  c-1 

2.628 

1.794  c1 

1.380e2 

Fat-tree 

8.134  e"2 

3.481  e-1 

2.237 

1 .795  e1 

1.361  e2 

Compute 

Ring 

3.013  e~^ 

2.320  e"1 

1.871 

1.493  c1 

1.309  e2 

Fat-tree 

3.436  e"^ 

2.420  c-1 

1.878 

1.493  c1 

1.310e2 

Communicate 

Ring 

4.582  e-2 

0.909  e-1 

0.757 

3.010 

7.110 

Fat-tree 

4.698  e~2 

1.061c-1 

0.359 

3.020 

5.140 

Table  1.  CPU  Time  (seconds)  of  Ring  and  Fat-Tree  Orderings 


n 

64 

128 

256 

512 

1024 

Overall 

Ring 

27.61 

51.96 

51.07 

59.85 

62.25 

Fat-tree 

25.78 

48.20 

60.00 

59.82 

63.11 

Compute 

Ring 

69.60 

72.32 

71.74 

71.92 

65.62 

Fat-tree 

61.03 

69.3.3 

71.47 

71.92 

65.57 

Table  2.  Mflops  Rates  of  Ring  and  Fat-Tree  Orderings 
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ABSTRACT 


Processors  with  multiple  functional  units,  including  the  superscalars, 
achieve  significant  performance  enhancement  through  low-level  execution 
concurrency.  In  such  processors,  multiple  instructions  are  often  issued  and 
definitely  executed  concurrently  and  out-of-order.  Consequently,  interrupt 
and  exception  handling  becomes  a  vexing  problem. 

We  identify  factors  that  must  be  considered  in  evaluating  the  effectiveness  of 
interrupt  and  exception  handling  schemes:  latency,  cost,  and  performance 
degradation.  We  then  briefly  enumerate  proposals  and  implementations  for 
interrupt  and  exception  handling  on  out-of-order  execution  processors. 

Next,  we  present  an  efficient  hardware  mechanism,  the  Instruction  Window 
(IW),  and  a  new  approach,  which  allows  for  precise,  responsive  and  flexible 
interrupt  and  exception  handling. 

The  implementation  of  the  IW  is  then  discussed.  The  design  of  an  8-cell  IW 
has  been  carried  out;  it  can  work  with  a  very  short  machine  cycle  time. 

Finally,  we  present  a  comparison  of  all  interrupt  and  exception  handling 
schemes  for  out-of-order  execution  processors. 


*  The  research  reported  herein  has  been  supported  in  part  by  the  Joint 
Services  Electronics  Program.  Contract  Number  F49620-90-C-0039. 
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Interrupt  Handling  For  Out-of-Order  Execution  Processors 


1.  Introduction 

Processors  with  multiple  functional  units  issue  and  execute  multiple 
instructions  concurrently  and  possibly  out-of-order;  they  enhance 
performance  by  extracting  low-level  concurrency  from  the  instruction  stream 
[1,  2,  3].  The  CDC  6600,  IBM  360/91,  and  the  CRAY  machines  are  forerunners 
of  this  class  of  processors;  however,  these  processors  issue  at  most  one 
instruction  per  cycle.  Due  to  advances  in  device  technologies,  recently 
announced  RISC  processors  often  issue  and  certainly  execute  multiple 
instructions  concurrently.  However,  these  processors  have  not  been  able  to 
support  interrupt  and  exception1  handling  efficiently  and  with  an  acceptable 
latency. 

In  this  paper,  we  address  the  interrupt  handling  problem,  which  has 
hampered  the  development  of  processors  which  execute  and  may  even  issue 
multiple  instructions.  We  propose  an  efficient  hardware  mechanism,  which 
supports  an  interrupt  handling  scheme  with  a  flexible  latency,  set  specifically 
for  each  type  of  interrupts  requested. 

The  remaining  sections  are  organized  as  follows:  Section  2  presents  a 
discussion  of  interrupts  and  exceptions.  Factors  for  evaluating  the 
effectiveness  of  interrupt  handling  schemes  are  presented.  Existing  proposals 
and  implementations  for  interrupt  handling  on  out-of-order  execution 
processors  are  briefly  reported  in  Section  3. 

Section  4  presents  the  Instruction  Window  (IW),  a  simple  and  yet 
versatile  hardware  mechanism  which  supports  efficient  and  flexible  interrupt 
handling.  Basic  window  operations  are  introduced  in  Section  5.  Section  6 
proposes  an  innovative  interrupt  handling  scheme,  which  makes  us  of  the 
IW.  In  Section  7,  we  discuss  the  implementation  of  the  IW.  Section  8  gives 
an  evaluation  of  all  interrupt  handling  schemes. 


2.  Interrupts  and  Exceptions 

An  important  and  indispensable  feature  of  any  processor  is  its  ability  to 
handle  properly  interrupts  and  exceptions.  An  I/O  device,  a  sensor,  or  a  timer 
may  "interrupt"  a  processor  to  perform  a  specific  task.  An  executing 


1  From  now  on,  we  will  simply  use  interrupt  to  stand  for  interrupt  and 
exception. 
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instruction  may  cause  a  page  fault  or  an  overflow /underflow;  an  "exception'' 
thus  results.  Finally,  one  may  place  an  instruction  in  an  instruction  stream  to 
call  for  a  "trap",  which  initiates  a  pre-planned  action.  Presentations  on 
interrupts,  exceptions  and  traps  can  be  found  from  many  sources,  among 
them:  [4,5,6, 7,8].  In  this  paper,  we  use  the  term  interrupt  to  denote  an 
interrupt,  an  exception  or  a  trap.  Our  study  does  not  treat  the  subject  of 
interrupt  detection;  rather,  we  investigate  how  a  processor  responds  to  an 
interrupt  request,  once  it  has  been  received. 

When  an  interrupt  request  is  received,  the  processor  must  save  its 
processor  state,  then  load  and  execute  an  appropriate  interrupt  handler.  Upon 
completion  of  the  interrupt  handling  routine,  the  saved  processor  state  is 
restored,  and  the  interrupted  process  can  then  be  restarted. 

A  processor  state  should  contain  enough  and  preferably  only  enough 
information  so  that  the  interrupted  process  can  be  restarted  at  the  precise 
point  where  it  was  interrupted.  To  be  able  to  resume  an  interrupted  process, 
the  processor  state  should  consist  of  the  contents  of  the  general  purpose 
registers,  the  program  counter,  the  condition  register,  all  index  registers  and 
the  relevant  portion  of  the  main  memory. 

The  classical  approach  to  identifying  precisely  the  point  where  a  process 
is  interrupted  is  to  save,  among  other  vital  items,  the  address  of  a  specific 
instruction,  say  instruction  a,  when  the  processor  state  is  saved.  All 
instructions  that  precede  instruction  a  have  been  executed.  And  instruction  a 
and  those  that  follow  it  have  not.  Instruction  a  thus  provides  a  precise 
interrupt  point. 

For  processors,  which  execute  instructions  concurrently  and  possibly 
out-of-order,  the  identification  of  a  precise  interrupt  point  when  an  interrupt 
request  is  made  may  become  very  costly. 

In  order  to  evaluate  interrupt  handling  schemes,  a  framework  must  be 
established.  Three  factors  have  been  identified: 

1)  Latency; 

An  interrupt  handling  approach  must  be  judged  by  the  latency 
between  the  receipt  of  an  interrupt  request  and  the  completion  of 
saving  the  processor  state.  Clearly,  any  acceptable  interrupt  handling 
scheme  should  yield  a  latency,  that  is  appropriate  for  the  interrupt 
request,  which  may  be  generated  internally  or  externally. 

2)  Component  Cost: 

The  cost  of  additional  hardware  and  software  incurred  by  the 
installation  of  an  interrupt  handling  scheme  must  be  considered. 
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3)  Performance  Degradation: 

The  presence  and  operation  of  an  interrupt  handling  scheme  may 
bring  about  performance  degradation;  its  extent  should  be  critically 
examined. 

There  are  three  sources  of  degradations: 

i.  Abort  --  In  response  to  an  interupt  request,  some  instructions 
that  have  already  been  partially  or  even  completely  executed  are 
"aborted"; 

ii.  Execution  inhibition  —  the  need  to  maintain  a  "consistent" 
processor  state  prevents  some  instructions  which  have  been 
executed  out-of-order  from  depositing  their  results;  this  in  turn 
inhibits  the  execution  of  subsequent  instructions  which  use  these 
results  as  operands; 

iii.  Update  -  Certain  schemes,  such  as  checkpointing,  require 
run-time  continuous  updating  operations,  which  have  to  be 
performed  by  the  processor. 


3«.Inlemipt  Handling  Schemes 

In  the  past,  the  trend  in  the  design  of  processors  with  multiple 
functional  units  has  been  towards  sequential  instruction  issue,  concurrent 
execution  and  possibly  out-of-order  instruction  completion. 

The  CDC  6600  [  9  ]  maintains  a  "SCOREBOARD"  to  resolve  dependency 
conflicts  among  instructions  in  an  instruction  stream,  and  allows  these 
instructions  to  complete  out-of-order.  The  "exchange  jump"  is  the  primary 
interrupt  mechanism  for  the  Central  Processing  Unit  (CPU)  to  handle 
interrupts.  If  the  exchange  jump  sequence  is  requested,  the  CPU  is  permitted 
to  issue  instructions  up  to,  but  not  including,  the  next  instruction  word.  All 
issued  instructions  are  allowed  to  run  to  completion.  The  CPU  registers  are 
then  interchanged  with  the  data  stored  in  the  exchange  package.  The  CPU  is 
restarted  at  the  location  specified  by  the  new  contents  of  the  program  counter. 
Since  the  processor  must,  on  average,  wait  for  two  instructions  to  be  issued 
and  completed  before  the  interrupt  can  be  serviced,  this  approach  exacts  a 
penalty  in  latency. 

The  IBM  360/91  supports  both  precise  and  imprecise  interrupt 
handling  [10].  Upon  the  receipt  of  a  precise  interrupt  request,  or  a  trap  (either 
precise  or  imprecise),  the  instruction  decoding  is  temporarily  halted  and  all 
issued  instructions  are  allowed  to  complete,  thereby  resulting  in  considerable 
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latency.  If  an  imprecise  interrupt  is  generated  (via  internal  processing),  the 
state  of  the  system  is  lost  and  therefore  the  interrupted  process  cannot  be 

restarted  precisely. 

When  an  interrupt  is  received  in  the  CRAY-1  (11,  12],  instruction  issue 
is  temporarily  terminated,  and  all  vector  and  memory  bank  references  are 
allowed  to  complete.  The  interrupt  handler  is  loaded  and  executed  in  a 
similar  manner  to  that  employed  by  the  CDC  6600  The  CRAY-1  must,  on 
average,  wait  for  two  instructions  to  complete  before  the  processor  state  can  be 
saved.  However,  as  the  CRAY-1  supports  complex  vector  operations,  the 
latency  (in  cycles)  may  be  longer  than  the  CDC  6600. 

Hwu  and  Patt  (13)  proposed  a  promising  approach  to  handling 
interrupts.  A  minimum  of  two  checkpoints  and  hence  two  additional  states 
are  required.  Essentially,  the  checkpoints,  which  invariably  incur  some 
penalty  in  processor  performance,  are  used  to  divide  the  sequential 
instruction  stream  into  smaller  units  to  reduce  the  cost  of  "repair" 

Smith  and  Pleszkun  (14j  presented  several  interesting  methods  to 
realize  the  classical  precise  interrupt.  The  simplest  is  the  in-order  instruction 
completion  method:  an  instruction  is  only  allowed  to  modify  the  processor 
state  when  all  its  preceding  instructions  are  certain  to  be  allowed  to  complete. 
A  "reorder  buffer"  is  added  so  that  instructions  are  permitted  to  complete  out 
of  order;  it  is  used  to  reorder  them  before  they  are  permitted  to  modify  the 
processor  state.  "History  buffer"  and  "future  files"  are  suggested  as 
alternatives.  Result  bypass  is  proposed  to  reduce  concomitant  performance 
degradations,  which  they  quantified  with  extensive  simulations. 

Sohi  [15]  deftly  combined  the  operations  of  reservations  stations  and 
reorder  buffers  into  the  "register  update  unit"  to  effect  precise  interrupts.  In 
addition,  Smith  and  Pleszkun  [14]  presented  several  very  stimulating 
"architectural  solutions";  these  include  saving  the  "intermediate  state  of 
vector  instructions"  and  saving  "a  sequence  of  instructions  that  must  be 
executed  before  the  saved  program  counter  is  precise". 


4.  The  Instruction, .Window  QW] 

In  this  section,  we  present  a  hardware  mechanism,  which  contributes 
toward  precise,  responsive  interrupt  handling  for  processors  with  multiple 
functional  units. 

The  general  structure  of  a  processor  with  multiple  functional  units  is 
shown  in  Figure  1.  It  depicts  a  General  Purpose  Register  (GPR),  or 
equivalently  Load  /Store,  architecture. 
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The  instruction  unit  prepares  the  incoming  instructions  for  execution, 
and  issues  the  instructions  to  the  appropriate  functional  units  via  the 
interconnection  network.  The  functional  units  operate  on  the  given 
operands  and  produce  results  which  are  returned  to  the  appropriate  registers. 

We  propose  the  installation  of  a  hardware  structure  named  the 
Instruction  Window  (IW).  The  IW,  shown  in  Figure  2,  consists  of  a  set  of 
registers,  to  be  called  cells.  One  and  only  one  instruction  occupies  a  cell.  Such 
a  cell  serves  as  a  "staging  area"  for  an  instruction.  In  a  conventional 
processor,  an  instruction  is  removed  from  the  staging  register  as  soon  as  it  has 
been  issued  to  a  functional  unit.  In  the  proposed  mechanism,  an  instruction 
remains  in  its  "staging  register"  after  its  issuance. 

We  use  a  three-operand  format  for  instructions: 

i;  OP  SI,  S2,  D  (1) 

where  i  denotes  the  instruction  tag,  OP  the  operation,  Si,  S2  the  registers  used 
as  sources,  and  D  the  destination  register. 

Each  cell  contains  at  least  three  fields:  issue,  tag,  and  instruction.  An 
optional  vector  element  number  (VEN)  can  be  added  for  those  processors 
with  vector  instructions. 

The  issue  field  has  one  bit,  which  is  used  to  indicate  whether  that 
instruction  has  been  issued  to  a  functional  unit.  This  field  will  not  be  shown 
in  later  figures. 

The  tag  field  contains  a  tag  which  uniquely  identifies  the  instruction 
held  in  that  cell. 

The  instruction  field  contains  a  copy  of  the  instruction  as  it  was  fetched 
from  the  Instruction  Buffer. 

The  optional  vector  element  number  (VEN)  field  is  set  to  a  value, 
equal  to  the  number  of  vector  elements  left  to  be  processed.  We  assume  the 
availability  of  a  Vector  Length  Register,  which  provides  the  initial  value  for 
VEN.  If  the  instruction  is  a  scalar  instruction,  VEN  is  set  to  1. 

Thus,  as  an  example,  the  following  2-instruction  sequence  may  appear 
in  the  IW  as  shown  in  Figure  3: 

1:  ADD  RO,  Rl,  R2 

2:  ADF  VRO,  VR1,  VR2  (2) 

Note  that  instructions  1  and  2  occupy  two  consecutive  cells  and  the  cell  for 
instruction  1  is  above  that  for  instruction  2.  The  issue  field  is  omitted.  The 
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opcode  ADD  stands  for  an  integer  addition  and  ADF  stands  for  a  floating¬ 
point  addition.  Registers  are  denoted  with  R's  and  vector  registers  with  VR  s. 

5.  Window  Operations 

We  present  in  the  following  the  basic  operations  for  the  IW:  fill,  issue, 
and  remove/update. 

Fill 


When  a  fill  operation  begins,  the  IW  has  already  "pushed"  its 
remaining  instructions  to  the  top.  Let  instruction  i  precede  instruction  j  in  an 
instruction  stream;  then  i  is  always  located  "higher"  in  the  IW  then  j.  And 
the  empty  cells  are  found  at  the  bottom. 

When  an  instruction  is  written  into  the  IW,  it  is  always  placed  at  the 
topmost  empty  cell  with  a  unique  tag.  Concurrently,  if  the  VEN  field  is 
implemented,  it  is  set  to  N,  the  vector  length  specified  by  the  Vector  Length 
Register.  The  instructions  freshly  written  into  the  IW  follow  the  same  order 
seen  in  the  instruction  stream. 

Due  to  restrictions  imposed  by  available  data  paths,  the  number  of 
instructions  that  can  be  moved  concurrently  from  the  instruction  buffer  to 
the  IW  has  to  be  limited. 

At  this  point,  a  reader  may  justifiably  have  concerns  about  the 
implementation  of  the  fill  and  other  operations  to  be  introduced  in  this 
section  and  their  possible  impact  on  the  machine  cycle  time;  this  will  be 
addressed  in  Section  7. 

We  use  the  sequence  of  instructions,  given  in  (3),  to  illustrate  the 
operations  of  the  IW. 


1 

MUL 

RO, 

Rl, 

RO 

2 

ADD 

R2, 

R3, 

R2 

3 

ADF 

VRO, 

VR1, 

VRO 

4 

ADD 

R4, 

R5, 

R4 

5 

ADD 

R6, 

R7, 

R6 

6 

ADD 

R8, 

R9, 

R8 

7 

ADD 

RIO, 

Rll, 

RIO 

The  opcode  MUL  stands  for  an  integer  multiplication  operation.  The 
sequence  in  (3)  is  designed  simply  to  present  the  salient  features  of  the  IW;  it 
is  not  meant  to  stand  for  any  meaningful  computation. 
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For  this  example,  let  the  processor  issue  at  most  one  instruction  per 
cycle.  It  is  further  assumed  that  instructions  be  written  into  the  IW  from  the 
instruction  buffer  at  a  rate  of  one  per  cycle.  Three  functional  units  are 
available:  an  integer  add  unit,  a  floating-point  add  unit,  and  an  integer 
multiply  unit.  Both  the  integer  and  the  floating  point  add  operations  have  a 
latency  of  three  cycles.  Pipelining  has  been  employed  so  that  an  add  operation 
may  be  started  every  cycle.  An  integer  multiply  takes  six  cycles  and  the 
multiply  unit  is  not  pipelined. 

In  cycle  1,  instruction  1  is  written  into  the  topmost  cell  of  the  IW.  In 
cycle  2,  instruction  2  is  written  into  the  IW.  As  long  as  there  is  room  in  the 
IW,  in  cycle  i  (  i=  3,  4,  ...,6),  instruction  is  written  into  the  IW.  Assuming  an 
IW  with  at  least  6  cells,  the  contents  of  the  IW  after  cycle  6  is  depicted  in 
Figure  4. 

Issue 


For  processors  which  issue  at  most  one  instruction  per  cycle,  at  the 
beginning  of  each  cycle,  the  topmost  instruction  that  has  not  been  issued  to  a 
functional  unit  is  examined.  If  an  appropriate  functional  unit,  and  the 
requisite  data  paths  are  available,  and  no  data  dependency  exits,  the 
instruction  is  issued.  For  possible  future  extensions  to  processors  which  may 
issue  up  to  k  instructions  per  cycle,  where  k  >  1,  at  least  the  topmost  k 
instructions  that  have  not  been  issued  are  examined. 

For  each  issued  instruction,  its  opcode,  operands  and  result 
specifications  are  passed  to  the  assigned  functional  unit,  and  actions  are 
initiated  to  copy  operands  from  the  source  registers  to  the  functional  unit. 

Fill  and  issue  operations,  as  specified,  will  be  performed  concurrently. 
Implementation  issues  are  discussed  in  Section  7. 

Returning  to  our  example,  instruction  1  is  issued  to  the  integer 
multiplier  in  cycle  2.  During  cycle  3,  instruction  2  is  issued  to  the  scalar  adder. 
Instruction  3  is  issued  to  the  floating-point  adder  in  cycle  4.  Note  that  each  of 
these  instructions  does  not  have  any  data  dependency.  For  each  subsequent 
cycle,  at  most  one  instruction  will  be  dispatched  to  a  functional  unit.  The 
processing  of  these  instructions  is  depicted  in  Figure  5,  where  L  stands  for  fill, 
I  for  issue,  E  for  execute,  and  S  for  deposit. 

Let  us  examine  instruction  2,  which  is  issued  in  cycle  3,  as  shown  in 
Figure  5.  Starting  in  cycle  4,  the  pipelined  integer  add  unit  operates  on  its 
operands.  Since  it  has  an  execution  latency  of  3  cycles,  it  will  produce  its  result 
at  cycle  7  and  proceed  to  deposit  it  into  the  destination  register,  R2. 

Remove/Update 
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At  the  instant  when  an  interrupt  request  is  received  by  the  processor, 
an  issued  instruction  may  be  at  an  intermediate  stage  of  execution.  We  have  a 
choice  of  aborting  the  execution  of  that  instruction  or  completing  its 
execution.  Aborting  an  instruction  means  that  the  execution  already 
performed  is  wasted;  on  the  other  hand,  letting  an  instruction  execution  run 
to  its  completion  may  impose  a  long  latency  before  responding  to  an  interrupt 
request. 

We  present  a  new  parameter:  No  Return  Point  (NRP):  the  execution 
point  after  which  the  instruction  should  not  be  prevented  from  changing  the 
processor  state. 

For  an  interrupt  which  requires  "fast"  response,  for  example  an 
internal  "machine  check",  we  can  set  the  NRP  to  be  at  the  start  of  the  final 
machine  cycle,  when  the  computation  result  is  written  into  the  the 
destination  register.  With  such  an  NRP  setting,  only  those  executing 
instructions  which  are  about  to  deposit  their  results2  are  allowed  to  complete; 
all  other  instructions  are  aborted;  some  of  the  instruction  processing  already 
performed  is  traded  away  for  short  latency. 

At  the  other  extreme,  the  NRP  can  be  set  at  the  start  of  an  instruction 
execution.  In  so  doing  all  instructions  which  have  started  execution  are 
allowed  to  complete.  As  appropriate,  the  NRP  can  be  set  somewhere  between 
the  two  extreme  cases. 

The  definition  of  the  NRP  provides  us  with  a  means  of  achieving 
flexible  responses  to  various  types  of  interrupts.  Each  interrupt  type  has  its 
own  NRP  setting;  a  processor  reacts  differently  in  response  to  different  types 
of  interrupts. 

For  illustrative  purposes,  we  set,  in  this  paper,  the  NRP  at  the  start  of 
the  final  cycle,  when  the  computation  result  is  written  into  the  destination 
register.  For  instruction  2,  its  NRP  is  cycle  7. 

When  an  instruction  reaches  its  NRP,  it  will  be  allowed  to  complete 
and  therefore  should  be  removed  from  the  IW  upon  its  completion.  To 
accomplish  this,  the  executing  functional  unit  returns  the  instruction  tag  to 
the  IW  for  identification. 

It  is  quite  reasonable  to  assume  that  it  takes  one  cycle  to  transmit  a  tag 
from  the  functional  unit  to  the  IW.  The  functional  unit  returns  the  tag  of  an 
executing  instruction  to  the  IW  one  cycle  before  it  reaches  its  NRP,  so  that  it 
will  be  identified  as  soon  as  it  reaches  its  NRP. 


2  Note  that  a  result  is  always  deposited  into  the  register  file,  not  the  IW. 
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More  than  one  instruction  may  reach  its  NRP  at  a  given  cycle.  To 
accommodate  the  return  of  multiple  tags,  we  propose  a  "1  out  of  w”  code  for 
the  tags.  As  an  example,  let  there  be  8  cells  in  the  IW,  w=8.  Each  tag  has  8  bits 
with  one  and  only  bit,  that  has  a  value  of  1.  In  this  case,  a  path  with  8  bits  will 
suffice  to  return  all  8  tags,  if  necessary. 

All  cells  whose  tags  match  the  returned  tags  are  marked.  Instructions 
residing  in  marked  cells  can  then  be  updated  or  removed.  An  instruction  in  a 
marked  cell  is  removed  if  its  VEN  value  is  1.  If  the  VEN  value  is  greater  than 
1,  then  it  is  decreased  by  1. 

Instructions  are  removed  so  that  all  empty  cells  are  found  at  the 
bottom  of  the  IW. 

Remove  and  update  operations  follow  the  tag  matching,  and  take  place 
concurrently. 

Since  the  start  of  cycle  7  is  the  NRP  of  instruction  2,  its  tag  is  returned 
to  the  IW  during  cycle  6.  In  cycle  7,  an  associative  search  is  performed  on  the 
IW  using  the  incoming  tag  as  a  key.  Since  the  VEN  value  for  instruction  2  is 
1,  during  the  remove/update  operations  of  cycle  7,  instruction  2  is  eliminated 
as  shown  in  Figure  6.  Note  that  the  remaining  instructions  are  pushed  to  fill 
the  top  of  the  IW,  preserving  their  order  in  the  instruction  stream.  Note  also 
that  register  R2  is  updated  with  the  result  produced  by  instruction  2. 

To  illustrate  the  IW  operations  further,  let  us  examine  the  execution  of 
instruction  3,  depicted  in  Figure  5.  The  first  element  of  the  add  instruction, 
issued  in  cycle  4,  will  pass  its  NRP  at  the  start  of  cycle  8.  Thus,  its  tag  is 
returned  to  the  IW  during  cycle  7,  and  is  associatively  matched  with  the  tags 
in  the  IW  during  tag  matching  in  cycle  8.  Then,  since  its  VEN  value  is  not 
equal  to  1,  the  vector  element  number  corresponding  to  instruction  3  is 
decreased  by  1  during  the  remove/update  operations  of  this  cycle,  as  shown  in 
Figure  7.  The  presence  of  2  in  the  VEN  field  indicates  that  two  vector 
elements  remain  to  be  processed. 

The  timing  of  these  three  basic  operations:  fill,  issue,  remove/upuate, 
is  depicted  in  Figure  8.  The  implementation  details  will  be  presented  in 
Section  7. 

Let  an  interrupt  request  be  received  in  cycle  8.  The  processing  of  the 
received  interrupt  request  will  be  described  in  the  following  section. 


6,  Interrupt  Handling 

Upon  the  receipt  of  an  interrupt  request,  the  processor  responds  as 
follows: 
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1)  At  the  start  of  the  following  cycle,  an  Interrupt  Request  Signal  is 
generated,  and  sent  to  the  executing  functional  units.  This  aborts 
all  instructions  that  have  not  passed  their  NRPs.  Any  instruction 
which  has  passed  its  NRP  is  allowed  to  complete  its  execution. 

2)  When  all  instructions  that  are  permitted  to  proceed  complete 
their  execution,  the  processor  state  is  saved.  The  IW  is  included  as 
a  component  of  the  processor  state  and  the  contents  of  the 
occupied  cells  of  the  IW  are  saved. 

3)  The  appropriate  interrupt  handler  is  then  fetched  from  memory 
and  executed  by  the  processor. 

In  the  proposed  scheme,  the  IW  is  included  as  a  component  of  the 
processor  state.  The  saved  contents  of  the  IW  provides  a  modified  "precise” 
interrupt  point.  The  IW  does  not  identify  one  instruction  which  defines  the 
precise  interrupt  point;  rather,  it  identifies  a  group  of  instructions  in  the  IW, 
which  jointly  define  the  "point”,  where  the  interrupted  processing  should 
resume. 

As  a  component  of  the  processor  state,  the  saved  contents  of  the  IW 
cannot  be  modified  by  any  interrupt  handler. 

If  an  instruction  remains  in  the  IW,  its  VEN  field  specifies  the  number 
of  elements  to  be  processed  for  the  given  instruction.  This  information  is 
used  to  restart  the  processor  at  the  completion  of  the  interrupt  handling 
procedure.  The  introduction  of  the  VEN  field  obviates  the  need  for  the 
processor  to  re-execute  an  incomplete  vector  instruction  from  the  very 
beginning  when  the  processing  resumes. 

Returning  to  our  example,  the  interrupt  request  was  received  during 
cycle  8.  The  processor  then  generates  the  Abort  Signal  during  cycle  9. 

At  the  start  of  cycle  9,  the  second  element  of  the  vector  operand 
specified  oy  instruction  3,  and  instructions  1  and  4,  pass  their  respective 
NRPs.  Thus  the  instructions  bearing  tags  1  and  4  are  eliminated  from  the  IW 
during  cycle  9,  and  the  VEN  value  for  instruction  3  is  decreased  by  1.  The 
resulting  contents  of  the  IW  are  shown  in  Figure  9;  it  is  saved  as  part  of  the 
processor  state. 

Note  that  Registers  RO,  R4,  and  the  second  element  of  the  vector 
register,  VRO,  are  all  updated  by  the  execution  of  instructions  1,  4  and  3 
respectively. 

Furthermore,  in  cycle  9,  all  functional  units  are  flushed  thereby 
eliminating  all  instructions  which  have  not  passed  their  NRPs.  The 
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maximum  number  of  execution  cycles  that  could  be  wasted  on  an  instruction 
by  flushing  the  functional  units  is  the  latency  of  the  "slowest"  functional 
unit  This  is  the  greatest  number  of  cycles  that  may  have  elapsed  between  the 
issuance  and  one  cycle  before  the  instruction  passes  its  NRP. 

The  processor  state  is  saved  during  cycle  10.  Note  that  instructions  3,  5, 
6  and  7  define  jointly  the  "precise"  interrupt  point.  Only  the  last  element  of 
the  vector  add  remains  to  be  processed  when  the  execution  of  the  interrupted 
process  resumes.  Note  that  instruction  4,  which  follows  instruction  3,  does 
not  appear  as  it  has  already  completed  its  execution  before  the  interrupt 
request  arrived. 

The  interrupt  handler  is  then  fetched  and  executed.  As  a  specific 
example,  suppose  one  of  the  instructions  being  executed  causes  an  exception 
due  to  overflow;  the  interrupt  handler  can  execute  the  program,  starting  with 
the  instructions  remaining  in  the  IW  -  the  modified  "precise"  interrupt  point, 
one  instruction  at  a  time  to  identify  the  source  of  the  problem. 

As  in  any  processor,  after  completion  of  the  interrupt  handler,  the 
original  processor  state,  of  which  the  IW  is  a  component,  is  restored. 
Instruction  issuing  is  then  restarted  from  the  top  of  the  IW. 

Returning  to  the  example,  the  IW  will  be  restored  as  shown  in  Figure  9 
at  the  completion  of  the  interrupt  handler.  Let  the  processor  state  be  restored 
at  cycle  X.  Thus,  instruction  3  is  issued  again  with  the  VEN  field  being  1  in 
cycle  X.  In  cycle  X+l,  instruction  5  is  issued  to  the  integer  adder.  In  cycle  X+2, 
instruction  6  is  issued.  Finally,  instruction  issuance  completes  in  cycle  X+3, 
when  instruction  7  is  issued  to  the  adder. 


7.  Implementation 

The  Instruction  Window  (IW)  plays  an  important  part  in  the  proposed 
interrupt  handling  scheme;  it  also  serves  as  the  staging  registers  for 
instruction  decoding.  The  IW  differs  significantly  from  a  "reorder  buffer"  [14] 
in  that  the  computation  results  are  not  deposited  into  the  IW  and  more 
importantly  instructions  can  be  removed  from  it  "out-of-order”.  We  now 
examine  the  implementation  issues  in  more  detail  and  assess  its  potential 
impact  on  machine  cycle  time. 

We  propose  that,  as  depicted  in  Figure  8,  in  each  machine  cycle  "fill", 
"issue"  and  "tag  match"  take  place  concurrently.  And  remove/update  follows 
tag  match. 

At  the  beginning  of  each  machine  cycle,  the  remaining  instructions  in 
the  IW  have  already  been  moved  to  occupy  the  top  cells.  And  the  incoming 
instructions  are  placed  into  the  cells  adjacent  to  the  occupied  ones. 
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The  identification  of  instructions  that  can  be  issued  deserves  more 
scrutiny.  The  first  task  is  to  identify  instructions,  at  the  top  of  the  IW,  that 
have  not  yet  been  issued.  Since  every  cell  carries  an  "issue"  bit,  simple 
combinational  circuit  can  be  used  to  accomplish  this. 

For  conventional  processors,  where  at  most  one  instruction  is  issued 
per  machine  cycle  and  instructions  are  issued  according  to  their  order  in  the 
instruction  stream,  only  the  top  unissued  instruction  in  the  IW  needs  to  be 
examined  for  possible  issuance.  This  is  exactly  what  is  done  anyway;  no 
additional  control  complexity  is  introduced.  For  future  processors  which  may 
issue  multiple  instructions,  more  complex  circuits  to  detect  and  resolve 
dependencies  have  to  be  installed.  For  an  8-cell  IW  and  a  16-element  register 
set,  we  find  that  the  "multiple  issue"  critical  path  incurs  a  16-gate  delay.  The 
actual  path  length  is  of  course  determined  by  the  packaging  details. 

We  now  discuss  the  tag  match,  followed  by  remove /update.  With  the 
”1  out  of  w"  coding  scheme,  all  returning  tags  from  the  executing  functional 
units  are  "or-ed"  into  one.  This  one  tag  is  matched  with  all  the  cells  whose 
instructions  have  been  issued.  Matched  cells  are  marked  for  removal  or 
update  in  the  second  half  of  the  cycle. 

A  marked  cell  is  removed  if  its  VEN  field  contains  1;  otherwise,  the 
value  in  the  VEN  field  is  decreased  by  1.  Since  we  require  that  all  remaining 
instructions  be  pushed  to  the  top  portion  of  the  IW,  combinational  circuits 
are  provided  to  write  into  each  cell  every  cycle.  For  an  8-cell  IW,  we  find  that 
the  path  length  is  12-gate  long.  Keep  in  mind  that  the  remove/update 
operation  follows  an  associative  memory  search  for  tag  matching;  the  total 
length  matches  very  nicely  with  the  "issue"  critical  path. 

To  summarize:  the  implementation  of  these  operations  for  a 
moderately  sized  IW  produces  a  critical  path,  which  can  work  with  a  very 
short  machine  cycle  time.  We  have  not  yet  studied  rigorously  the  size  of  the 
IW,  which  can  be  much  larger  that  8  for  future  processors 

8,  Evaluation 

Now  we  evaluate  the  scheme  with  IW  and  other  schemes  reported  in 
Section  3,  using  the  criteria  enumerated  in  Section  2.  Table  1  provides  a 
summary.  The  "Abort",  "Execution  inhibition"  and  "Updating"  performance 
degradations  are  defined  in  Section  2. 
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Table  1:  Evaluation  of  interrupt  handling  schemes 


Latency 

Component 

Cost 

Performance 

Degradation 

CDC6600  [9] 

On  the  average, 
two  instructions 
are  to  be  issued 
and  executed. 

Provisions  for 
exchange  jump. 

None. 

360/91  [10] 

All  issued 
instructions  are 
allowed  to 
complete. 

None 

None 

CRAY-1  [11,12] 

All  vector  and 
memory  bank 
reference  inst's 
are  allowed  to 
complete. 

None 

None 

HPS  [13] 

Needs  to  return 
to  the  nearest 
consistent  state. 

registers,  memory 
and  data  paths 
needed  to 
implement 
checkpoints. 

Abort  and  update 

degradations 

incurred. 

In-order  Inst. 
Completion  [14] 

Relatively  short. 

Needs  a  "Result- 
shift"  register  file. 

Abort,  execution 
inhibition  and 
update 
degradations 
incurred. 

Reorder  (History, 
Future  File) 

Buffer  [14, 15] 

Relatively  fast. 

Needs  buffers  and 
data  paths. 

Abort,  execution 
inhibition  and 
update 
degradations 
incurred. 

Reorder  Buffer 
with  bypass  [14,15] 

Relatively  fast. 

Needs  buffers  and 
elaborate  data 
paths. 

Abort  and  update 

degradations 

incurred. 

Instruction 
Window  (IW) 

Flexible  with 
adaptive  NRP 
settings. 

Needs  to 
implement  IW. 

No  update  and 
execution 
inhibition 
degradations. 

Abort  penalty  is  a 
function  of  NRP. 
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Figure  1:  General  Purpose  Register,  Load/Store,  Processor  Structure 
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Figure  2:  The  Instruction  Window  (IW) 
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Figure  4:  The  IW  after  Cycle  6 
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Figure  6:  The  IW  after  Cycle  7 
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Abstract 

The  achievement  of  fast,  precise  interrupts 
and  the  implementation  of  multiple  levels  of 
branch  predictions  are  two  of  the  problems 
associated  with  the  dynamic  scheduling  of 
instructions  for  superscalar  processors.  Their 
solution  is  especially  difficult  if  short  cycle  time 
operation  is  desired.  We  present  solutions  to 
these  problems  through  the  development  of  the 
Fast  Dispatch  Stack  (FDS)  system. 

We  show  that  the  FDS  is  capable  of 
scheduling  storage,  branch,  and  register-to- 
register  instructions  for  concurrent  and  vut-of- 
order  executions;  the  FDS  implements  fast  and 
precise  interrupts  in  a  natural,  efficient  way;  and 
it  facilitates  speculative  execution  --  Instructions 
preceding  and  following  one  or  more  predicted 
conditional  branch  instructions  may  issue.  When 
necessary,  their  effects  are  undone  in  one 
machine  cycle. 

We  evaluated  the  FDS  system  with  extensive 
simulations. 


1.  Introduction 

Superscalars  exploit  instruction  level 
parallelism  by  issuing  multiple  instructions  each 
cycle  to  functional  units  when  dependencies 
allow.  Instruction  scheduling  can  be  performed 
during  compilation  (static  scheduling)  or  during 
execution  (dynamic  scheduling),  or  both. 
Dynamic  scheduling  detects  instruction 
dependencies  in  a  segment  of  the  dynamic 
instruction  stream.  The  most  general  form  of 
dynamic  scheduling,  the  issue  and  execution  of 
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multiple  out-of-order  instructions,  can 
significantly  enhance  system  performance 
[1][4][9].  However,  there  are  problems  with  this 
scheme  that  undermine  its  usefulness. 

The  achievement  of  precise  interrupts  is 
difficult,  particularly  if  a  fast  response  time  is 
desired.  Interrupts  are  precise  if  processor  state 
visible  to  the  operating  system  and  application 
can  be  reconstructed  to  the  state  a  processor 
would  have,  had  all  instructions  executed  in 
sequence  up  to  the  point  of  an  interrupt;  this  is 
costly  to  implement,  particularly  if  out-of-order 
stores  to  memory  may  occur. 

Branches,  about  15 %  to  30C(-  of  executed 
instructions  for  many  applications  [6],  decrease 
the  effectiveness  of  multiple  issues  to  functional 
units  if  instructions  following  an  undecided 
branch  cannot  be  issued.  Performance  may  be 
improved  by  enabling  speculative  executions  on  a 
predicted  path  of  instructions.  If  the  gains  on 
correct  paths  outbalance  the  losses  from  nullifying 
execution  effects  on  incorrect  paths  (squashing), 
performance  improves. 

It  is  imperative  that  cycle  time  be  considered 
when  investigating  new  processor  structures. 
Since  a  processor’s  performance  depends  on 
throughput  (instructions  issued  per  cycle)  and 
cycle  time,  if  throughput  is  increased  at  the 
expense  of  cycle  time,  a  net  performance 
improvement  may  not  occur;  performance  may,  in 
fact,  decrease  during  the  execution  of  inherently 
sequential  code.  Hence,  we  have  developed  FDS 
structures  that  operate  on  a  short  cycle  time. 

This  paper  is  organized  into  5  sections.  A 
dynamic  scheduling  mechanism  that  may  issue 
multiple,  out-of-order  instructions  each  machine 
cycle  is  presented  in  Section  2.  A  fast,  precise 
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interrupt  handling  capability  is  derived  in 
Section  3  and  an  instruction  squashing  capability 
is  presented  in  Section  4.  The  performance  of  the 
proposed  structure  is  evaluated  in  Section  5  and 
tradeoffs  are  analyzed  with  simulation.  In 
Section  6,  we  present  conclusions. 


perform  dependency-free  memory  accesses.  A 
Load/Store  instruction  set  architecture  is 
assumed. 


2.1  The  Buffer  Unit 


2.  The  Fast  Dispatch  Stack  System 

We  present  an  overview  of  the  structure  and 
operation  of  the  Fast  Dispatch  Stack  (FDS) 
system  (Figure  1)  in  this  section.  A  detailed  and 
comprehensive  presentation  of  the  FDS  system, 
a  major  enhancement  of  the  Dispatch  Stack  [1], 
can  be  found  in  [2].  The  FDS  contains  a  Buffer 
Unit  (BU)  and  an  Issue  Unit  (IU).  The  BU 
supplies  instructions  to  the  IU  in  a  form  that 
facilitates  fast  dependency  detection.  The  IU 
detects  instruction  register  dependencies  and 
issues  instructions  with  no  dependencies  to  the 
functional  units  (FUs)  each  <ycle  via  an 
interconnection  network. 


Figure  1:  Fast  Dispatch  Stack  system. 


The  FUs  indicate  instruction  completion  by 
returning  tags  that  are  issued  with  instructions. 
The  FUs  read  operands  from  and  return  results 
to  the  Register  File.  Data  Units  receive  storage 
instructions  from  the  IU,  generate  their  effective 
addresses,  insert  them  into  the  Address  Stack 
where  address  dependencies  are  detected,  and 


The  BU  fetches  multiple  instructions  per 
instruction  cache  access  (a  fetch  block)  and 
generates  four  vectors  for  each  instruction:  a  tag, 
a  read-vector,  a  write-vector  and  a  type-vector. 
Read -vector,  and  Write-vector,  are  generated  from 
instruction  qi  and  specify  the  registers  that  q, 
reads  and  writes  respectively  in  vectors  of  binary 
elements,  one  element  for  each  register.  An 
element  in  position  j  is  1  if  register  j  is  accessed, 
and  0  otherwise.  A  type-vector  specifies  an 
instruction’s  type  in  a  linear  array  of  elements, 
one  position  for  each  type. 


ADD  3,  5,  6 

The  sum  of  the  contents  of  registers 
3  and  5  is  written  into  register  6. 
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Thg  i  Type- Vector  j  1, 

T  %  b  4  3  1  >  0  I  4  3  t  1  0 

t . ' . .  ' ..... - _ j 

Read- Vector  j  Write-Vector, 

ift  unmiifl  i  »  t  i  6  <  n  i  o  is  u  n  if  u  10  t  >  ?  4  ft  «  >  t  i  8 

EEEEEEEEEEEEEEEE]  EE  HU  °  V°  i c  EE  i  °  i c  1  °T-Yn 


Figure  2:  An  example  I-Group. 


A  tag  is  a  vector  of  binary  elements  with  a 
length  equal  to  the  number  of  available  tags.  Each 
tag  is  unique  with  one  element  in  each  tag  set  to 
1  and  the  remainder  to  0.  Instruction  q’ s  tag  is 
designated  Tagt. 

An  instruction  together  with  its  vectors 
constitute  an  I-Group.  /-Group,  is  derived  from  q,. 
Figure  2  shows  an  ADD  instruction’s  I-Group. 
Tagi  and  Type-vector ,  have  representative 
assignments.  I-Groups  are  either  transferred 
directly  to  the  IU,  or  are  temporarily  buffered  in 
the  BU  to  be  forwarded  later.  The  BU  also 
transfers  an  instruction’s  tag  and  type-vector  to 
the  Address  Stack. 

A  limited  form  of  register  renaming  is 
performed  in  the  BU  [2].  An  architected  register 
in  an  instruction  is  given  the  name  of  one  of  two 
physical  registers  that  are  reserved  for  its 


exclusive  use.  A  write  to  an  architected  register 
causes  it  to  be  given  a  name  different  from  its 
previous  one.  This  scheme  is  used  in  Section  5. 

2J2  The  Issue  Unit 

The  IU  is  composed  of  the  Stack  and  the 
Dispatcher  (Figure  3).  The  Stack  stores  I-Groups 
received  from  the  BU  in  individual  buffers  (slots), 
detects  register  dependencies  between 
instructions,  repositions  I-Groups,  filling  empty 
slots,  and  removes  completed  I-Groups.  The 
Stack  determines  which  slots  contain  register 
independent  instructions  each  cycle. 

The  critical  path  length  in  an  IU  with  8  slots 
is  14  to  18  gate-delays,  depending  on 
implementation  details  [2].  A  maximum  gate  fan- 
in  of  9  is  assumed.  If  this  fan-in  cannot  be 
achieved,  the  critical  path  length  must  be 
increased  by  a  small  amount. 


Figure  3:  A  block  diagram  of  the  Issue  Unit. 


The  Dispatcher  prioritizes  register 
independent  instructions  each  cycle  based  on 
their  precedence  and  transfers  a  subset  of  them 
to  output  ports.  A  port  is  an  entry  point  into  the 
interconnection  network  for  one  instruction  and 
its  tag.  The  network  may  consist  of  buses,  one 
attached  to  each  port,  with  one  or  more  FUs  on 


each  bus. 

A  stack  of  size  n  is  an  array  of  n  slots  with 
Slot0  at  the  top.  A  slot  contains  conflict  detection 
logic,  tag  comparison  logic,  registers  to  hold  an 
I-Group,  and  logic  for  transferring  an  I-Group  into 
the  slot.  The  Stack  may  therefore  hold  n 
instructions  in  n  I-Groups.  I-Groups  occupy 
positions  in  the  Stack  based  on  precedence,  with 
the  instruction  of  the  highest  precedence  in  Slot0. 
Therefore,  an  instruction’s  register  usages  need 
only  be  compared  with  those  of  instructions  in 
higher  slots.  Independent  instructions  are  issued 
from  the  IU,  contiguous,  completed  instructions 
are  removed  from  the  Stack,  remaining  I-Groups 
are  moved  upward,  and  new  I-Groups  are 
transferred  into  the  Stack  at  the  end  of  each  cycle. 
A  detailed  description  of  the  logic  that  performs 
these  functions  is  found  in  [2]. 


Figure  4:  Top  compression  and  total  compression. 


Stack  compression  logic  selects  I-Groups  for 
removal  and  transfer.  We  have  developed  two 
selection  methods:  Total  Compression  which 
removes  completed  I-Groups  from  all  slots;  Top 
Compression  which  removes  only  a  contiguous 
sequence  of  completed  I-Groupe  from  the  top  of 
the  Stack  (Figure  4). 


Figure  5:  Illustrative  pipeline  timing  of  two 
ADD  instructions  in  a  FDS  system  with  no 
resource  conflicts. 
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The  pipeline  timing  of  two  ADD  instructions, 
qi  and  qun  in  a  FDS  system  with  no  resource 
conflicts  is  shown  in  Figure  5.  The  instructions 
are  fetched  together;  however  qui  is  executed 
after  q,  because  it  has  a  data  dependency  on  it. 

2.3  The  Address  Stack 

The  Address  Stack  is  a  linear  array  of  n  slots 
with  A-Slot0  at  the  top.  There  is  a  one-to-one 
positional  correspondence  between  IU  stack  slots 
and  Address  Stack  slots.  Information  on  a  given 
instruction  is  held  in  identical  slot  positions  in 
the  IU  and  the  Address  Stack.  This 
correspondence  is  maintained  with  simultaneous 
compression  operations  in  both  stacks.  A-Slot, 
contains  the  tag  and  type,  and,  if  a  storage 
instruction,  an  effective  address  when  generated, 
address  conflict,  and  memory  access  status 
information  on  the  instruction  in  IU  Slotj.  When 
the  Buffer  Unit  transfers  I-Groupk  to  IU  Slotj,  a 
copy  of  Tagk  and  Type-Vector,,  is  transferred  to 
A-Slotj. 

A  data  unit  generates  and  inserts  the 
effective  address  of  storage  instruction,  qs,  into 
the  Address  Stack  slot  containing  a  copy  of  its 
tag,  Tags.  The  effective  address  of  a  storage 
instruction,  qs ,  is  compared  with  that  of 
preceding  storage  instructions  in  the  Address 
Stack,  i.e.,  with  those  in  higher  slots.  The  tag  of 
an  address  conflict  free  storage  instruction  is 
asserted  and  maintained  on  the  Conflict-Free 
Bus  until  it  completes.  This  bus,  similar  to  the 
Tag  Bus,  simultaneously  accommodates  multiple 
tags.  It  is  monitored  by  one  or  more  data  units 
for  address  conflict  information  on  multiple 
storage  instructions. 

2.4  The  Data  Unit 

A  data  unit  generates  the  effective  addresses 
of  storage  instructions,  accesses  the  register  file, 
and  performs  memory  accesses  requested  by  the 
IU  and  approved  by  the  Address  Stack.  It  may 
temporarily  buffer  data  transferred  between  the 
cache  and  the  register  file  to  release 
dependencies  of  following  instructions  on  storage 
instructions  and  to  prefetch  data.  A  FDS  system 
contains  one  or  more  data  units.  Each  may 
submit  at  most  one  memory  access  request  via  a 
dedicated  connection  (port)  to  the  Cache  each 
cycle.  A  request  includes  a  storage  instruction’s 


tag.  The  cache  returns  the  tags  of  completed 
accesses  on  the  Memory  Tag  Bus  which  is 
monitored  by  the  Address  Stack  and  data  units 
for  access  completions. 

2.5  Storage  Instruction  Execution 

The  IU  issues  a  storage  instruction,  qs,  to  a 
data  unit  and  then  provides  it  with  register 
conflict  information  on  qs.  Based  on  this  and 
address  conflict  information  from  the  Address 
Stack,  the  data  unit  executes  qs  in  phases, 
informing  the  IU  when  to  release  dependencies  on 
qs  by  deleting  register  use  representations  from 
its  I-Group. 

A  register  used  in  the  generation  of  the 
effective  address  for  qs  is  an  address  register  of 
qs.  A  register  whose  contents  are  fetched  or  stored 
by  qs  is  a  data  register  of  qs. 


Figure  6:  Illustrative  load  and  store  instruction 
timing  assuming  a  1-cycle  data  cache  access  and 
no  register  or  address  conflicts. 


Execution  proceeds  in  two  phases  that  are 
entered  in  sequence:  Phase  A  and  Phase  B  (see 
Figure  6).  Phase  A  is  initiated  by  the  issuance  of 
qs  to  an  available  data  unit  when  its  address 
registers  have  no  conflicts.  The  data  unit 
generates  and  inserts  qs's  effective  address  into 
the  Address  Stack  slot  containing  a  copy  of  its  tag, 
Tags.  Phase  B  is  initiated  by  the  IU,  when  gs's 
data  register  has  no  conflicts  and  Phase  A  has 
begun,  by  placing  qs's  tag  on  the  Tag  Bus  during 
a  specified  part  of  a  machine  cycle.  When  a  tag  in 
a  data  unit  matches  one  on  the  Tag  Bus,  the  unit 
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Figure  7:  A  FDS  system  with  a  precise  interrupt 
capability. 


3.  Precise  Interrupt  Handling 


assume  states  that  are  consistent  with  sequential 
instruction  execution. 

Recall  that  the  write-vector  in  an  instruction’s 
I-Group  specifies  its  destination  register.  The 
contents  of  W-Regt  are  transferred  to  A-Reg:  if 
W-Regt  is  a  destination  register  in  the  write-vector 
of  an  I-Group  that  is  compressed  out. 


A  precise  interrupt  may  be  caused  by  an 
instruction,  qjt  that  is  executing  out-of-order,  i.e., 
not  all  instructions  preceding  <7,  have  completed. 
To  achieve  a  processor  state  that  reflects  that  of 
a  conventional  machine  that  executed 

instructions  up  to  qJt  instructions  that  precede  gv 
must  complete  execution.  If  an  instruction  <7,, 
which  precedes  instruction  qjt  causes  an 

interrupt  while  the  conventional  interrupt  point 
for  qi  is  being  achieved,  the  saved  state  is  that  of 
a  conventional  machine  that  executed 

instructions  up  to  gt. 

Recall  that  top  compression  removes  a 
contiguous  sequence  of  I-Groups  whose 
instructions  are  complete  from  the  top  of  the 
Stack  each  cycle.  Instructions  are  removed  from 
the  IU  in  the  order  they  entered,  i.e.,  in 
instruction  stream  order.  This  fact  is  central  to 
the  scheme  presented. 

Figure  7  depicts  a  FDS  system  with  a  precise 
interrupt  capability.  FUs  and  data  units  read 
operands  from  and  write  results  to  a  set  of 


3.1  States  Assumed  by  the  Architected 
Registers 

The  A-Regs  change  state  only  after 
compression  operations  in  the  IU.  The  A-Regs  are 
in  the  state  they  would  have  in  a  conventional 
machine  after  it  executed  the  last  instruction  in 
the  most  recently  compressed  out  group  of 
I-Groups.  A  compression-block  is  a  group  of 
I-Groups  containing  instructions  that  are 
concurrently  compressed  out  and  is  identified  by 
the  last  instruction  in  the  group.  Let  instruction 
qa  be  the  last  instruction  in  compression-block,. 
Let  sa  be  the  state  of  the  A-Regs  in  a  conventional 
machine  after  executing  qa. 

Figure  8  illustrates  how  the  A-Regs  in  a  FDS 
system  change  states  in  two  situations:  once  with 
no  interrupts  (Figure  8(a)),  and  once  with  an 
interrupt  (Figure  8(b)).  When  no  interrupts  occur, 
the  A-Regs  in  a  FDS  system  experience  states  s,.f, 
s-^5  and  s,^,  assuming  the  compression  blocks 
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shown.  The  instruction  stream  is  again  processed 
in  Figure  8(b),  but  this  time  q itS  causes  an 
interrupt.  Since  qttS  does  not  complete,  it  is  not 
compressed  out  and  preceding  instructions  are. 
After  instruction  qit4  is  compressed  out,  the 
A-Regs  constitute  part  of  the  process  state  of  a 
conventional  machine  that  completed  qlW,  slW. 

As  discussed  above,  one  or  more  I-Groups 
may  be  removed  from  the  IU  concurrently.  The 
write-vectors  of  these  I-Groups  are  placed  on  the 
Register  Bus.  Write-vector  elements  on  the  bus 
control  the  transfer  of  data  from  the  W-Regs  to 
the  A-Regs  via  Register  Transfer  Logic.  The 
assertion  of  a  write-vector  element  specifying 
destination  register  Ri  on  the  Register  Bus 
transfers  a  datum  in  W-Regi  to  A-Regt.  Since  each 
write-vector  contains  at  most  one  True  (1) 
element,  the  destination  registers  of  multiple 
I-Groups  may  be  specified  on  the  Bus  causing 
concurrent  transfers. 

The  FDS  must  determine  when  to  restore  the 
W-Regs  to  a  conventional  machine  state.  An 
Interrupt  Bus  (Figure  5)  connects  IU  Slot0  with 
system  units  that  can  detect  an  interrupt -causing 
condition.  These  units  may  include  the  FUs,  the 
cache,  and  units  that  detect  external  interrupt- 
causing  conditions  (e.g.,  I/O  or  sensor  interrupts). 

An  interrupt  is  associated  with  an 
instruction,  qt,  if  q,  caused  a  condition  that  must 
be  identified  with  it.  The  unit  detecting  the 
condition  asserts  q,’ s  tag  on  the  Interrupt  Bus. 
Since  q,’s  tag  is  not  asserted  on  the  Tag  Bus,  it 
does  not  complete  and  is  not  compressed  out  of 
the  IU.  Logic  in  IU  Slot0  detects  the  presence  of 
an  instruction  associated  with  an  interrupt  by 
comparing  an  instruction’s  tag  with  those  on  the 
Interrupt  Bus  each  machine  cycle. 

Let  qt^  be  associated  with  an  interrupt 
detected  in  a  FU.  Its  tag  is  placed  on  the 
Interrupt  Bus  by  the  FU.  Instruction  q j<5  will 
occupy  Slot0  after  preceding  instructions  have 
completed.  The  instructions  preceding  qitS  may 
complete  concurrently  and  out-of-order.  The  time 
these  instructions  take  to  complete  is  not  lost 
because  they  are  not  re-executed  when  processing 
resumes.  The  match  of  q^'s  tag  in  Siot0  with 
that  on  the  Interrupt  Bus  causes  the  A-Regs  to 
be  transferred  to  the  W-Regs,  placing  them  in 
the  state  they  would  have  in  a  conventional 
machine  that  executed  instructions  preceding 
<7,^.  If  more  than  one  tag  is  asserted  on  the 
Interrupt  Bus,  the  interrupt  taken  is  the  one 
associated  with  the  instruction  of  the  highest 
precedence.  It  will  reach  Slot0  before  other 


interrupt-causing  instructions. 

An  interrupt  may  be  caused  by  a  condition 
external  to  the  processor  (e.g.,  an  I/O  or  sensor 
interrupt).  In  this  case,  further  instruction 
processing  is  not  necessary  to  achieve  a 
conventional  interrupt  point.  For  fast  operation, 
the  interrupt  point  saved  is  the  one  associated 
with  the  instruction  in  Slot0.  The  unit  detecting 
the  interrupt  condition  asserts  all  tags  on  the 
Interrupt  Bus  concurrently  by  placing  all  Is  on  it. 
SlotQ  detects  a  tag  match  and  causes  a  transfer  of 
A-Regs  to  W-Regs. 

A  problem  is  caused  by  an  instruction  that 
overwrites  a  value  in  a  W-Reg  before  it  is 
transferred  to  an  A-Reg.  To  prevent  this  hazard, 
an  instruction  that  writes  to  the  destination 
register  of  a  preceding,  completed  instruction  in 
the  IU  is  not  issued. 


3.2  Precise  Interrupts  and  Memory 

We  outline  a  scheme  that  causes  main 
memory  to  experience  states  consistent  with 
sequential  instruction  execution  while  multiple 
out-of-order  load  and  store  instructions  are 
executed.  A  comprehensive  treatment  is  found  in 
[2].  If  an  instruction,  q„  causes  an  interrupt, 
memory  is  left  in  a  state  as  it  would  be  in  a 
conventional  machine  that  has  executed  all 
instructions  preceding,  but  not  including,  qr 

A  copy-back  cache  is  used;  a  datum  that  is 
stored  into  a  copy-back  cache  may  be  transferred 
to  main  memory  at  a  later  time.  We  replace  each 
cache  line  with  a  cache  line  couple  composed  of 
two  cache  lines.  A  line  of  data  in  main  memory, 
previously  mapped  to  a  cache  line,  is  mapped  to  a 
cache  line  couple.  Items  in  one  line  of  a  cache  line 
couple  have  a  one-to-one  positional  correspondence 
with  those  in  the  other  line.  Two  corresponding 
items  form  a  data  couple.  Items  in  a  data  couple 
share  the  same  address  and  are  given  a  status  of 
Current  or  Pending.  At  any  time,  one  datum  is 
current  and  one  is  pending. 

A  new  cache  line  that  is  fetched  from  main 
memory  is  copied  into  both  lines  of  a  cache  line 
couple.  Items  are  marked  Current  in  one  line  and 
Pending  in  the  other.  A  store  instruction 
overwrites  the  pending  datum  of  a  data  couple. 
When  a  store  instruction  compresses  out  of  the 
IU,  the  datum  it  stored  (marked  Pending)  is 
marked  Current  and  the  other  datum  of  the  data 
couple  (marked  Current)  is  marked  Pending. 
Recall  that  the  Address  Stack  prevents  a  store  to 
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the  effective  address  of  a  preceding  completed 
store  instruction  in  the  IU.  Therefore,  a  pending 
datum  can  not  be  overwritten  before  it  becomes 
current.  Data  are  marked  Current  as  store 
instructions  are  compressed  out  of  the  IU  in 
instruction  stream  order,  so  that  current  data  in 
the  cache  is  consistent  with  sequential 
instruction  execution.  Only  current  data  is  copied 
back  to  the  main  memory. 

Let  instruction  cause  an  interrupt. 
Preceding  instructions  complete  and  are 
compressed  out  of  the  IU.  When  a  store 
instruction  is  compressed  out,  the  datum  it  wrote 
is  marked  current.  When  qi  reaches  Slot0  in  the 
IU,  a  pending  item  remaining  in  the  cache  is 
overwritten  with  the  current  item  in  its  data 
couple.  Current  data  may  be  copied  back  to  the 
main  memory  if  necessary.  Main  memory  (and 
the  cache)  is  now  in  a  state  consistent  with 
sequential  instruction  execution  up  to  q{. 

3.3  Comparisons  with  Previous  Work 

Smith  and  Pleszkun  have  presented  solutions 
(i.e.,  Reorder  Buffer,  Histoiy  Buffer,  and  Future 
File)  to  the  precise  interrupt  problem  for  systems 
in  which  at  most  one  instruction  may  issue  (in- 
order)  and  complete  (possibly  out-of-order)  each 
cycle  and  store  instructions  are  executed  in-order 
[71.  The  FDS  does  not  have  these  restrictions  on 
issuances  and  completions.  It  has,  in  effect,  an 
integrated  reorder  buffer  that  may  update 
architected  registers  with  multiple  results  each 
cycle,  causing  them  to  "skip"  some  conventional 
machine  states  that  are  unnecessary  for  them  to 
assume.  We  avoid  a  potential  bottleneck  in  the 
FDS  design  by  not  routing  results  through 
instruction  issuing  logic  as  in  Sohi’s  RUU  [8]. 

Another  approach,  supporting  a  model  of 
execution  similar  to  Smith  and  Pleszkun’s,  is 
checkpoint  repair  [3].  A  minimum  of  3  sets  of 
registers  are  used  to  save  and  restore  state.  A 
tradeoff  must  be  made  between  the  frequency 
with  which  state  is  saved  and  the  amount  of 
useful  results  that  may  be  discarded  and 
recalculated  upon  an  exception.  The  scheme  may 
cause  instruction  issuing  to  stall  under  certain 
circumstances. 


4.  Instruction  Squashing 

As  an  uncompleted  branch  instruction  is 
compressed  upward  in  the  IU,  the  number  of 
instructions  which  can  be  issued  becomes  smaller, 
decreasing  throughput.  In  this  section,  we  present 
an  instruction  squashing  scheme  that  facilitates 
the  use  of  branch  prediction  techniques  in  the 
FDS. 

A  branch  instruction,  qg,  transfers  control  to 
9b,,  or  to  an  out-of-sequence  branch  target 
instruction.  The  branch  target  is  not  known  until 
qB  executes.  Since  qB,t  is  often  fetched  before  qB 
executes,  a  transfer  of  control  to  qB.,  usually 
causes  little  or  no  processing  delay.  Processing  is 
likely  to  be  delayed  if  control  is  transferred  to  an 
out-of-sequence  branch  target  that  is  fetched  after 
qB  executes. 

Branch  prediction  schemes  attempt  to  reduce 
processing  delays  by  predicting  and  fetching  the 
branch  target  instruction  before  the  branch  is 
executed.  Branch  prediction  techniques  have  been 
presented  by  others  [5][6).  A  prediction  accuracy 
of  about  80%  to  98%  is  achieved  depending  on  the 
nature  of  the  computation  and  the  technique 
employed. 

A  key  issue  associated  with  using  branch 
prediction  in  a  processor  that  may  issue  multiple, 
out-of-order  instructions  is  the  expeditious 
squashing  of  instructions  executed  on  an 
incorrectly  predicted  path.  This  is  more  difficult 
than  in  a  conventional  machine  because 
instructions  preceding  and  following  a  predicted 
branch  instruction  may  coexist  in  the  issuing 
mechanism  and  may  execute  concurrently  and 
out-of-order  before  the  branch  outcome  is  known. 

When  the  Buffer  Unit  detects  a  branch 
instruction,  qB,  in  a  fetch  block,  an  algorithm  is 
used  to  predict  the  outcome  of  the  branch.  If  the 
branch  is  predicted  to  be  taken,  the  target 
instruction  and  instructions  following  it  are 
fetched  and  transferred  to  the  IU;  otherwise,  the 
fetching  of  instructions  continues  on  the  present 
path.  The  BU  saves  qB,  its  tag,  and  its  predicted 
outcome  and  forwards  instruction  qB  to  the  IU. 

When  dependencies  allow,  the  IU  issues  qB  by 
placing  its  tag  on  the  Tag  Bus  (Figure  7).  The  BU 
and  the  data  units  monitor  the  Tag  Bus  at  this 
time.  The  BU  executes  qg  when  its  tag  matches 
that  on  the  Bus  and  compares  its  outcome  with  its 
predicted  outcome.  If  the  predicted  outcome  is 
correct,  the  Buffer  Unit  places  qB'a  tag  on  the  Tag 
Bus  just  as  functional  units  do  for  completed 
instructions.  The  branch  instruction  is  then 
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marked  complete  in  the  IU.  If  qB’s  predicted 
outcome  is  incorrect,  the  BU  places  qB’s  tag  on 
the  Interrupt  Bus  and  invalidates  instructions  in 
its  buffer. 

Assume  that  qB  has  executed  and  is  found  to 
be  incorrectly  predicted.  Since  qBs  tag  is  not 
asserted  on  the  Tag  Bus,  it  is  not  marked 
complete  in  the  IU  and  so  it  eventually  occupies 
Slot0  in  the  IU  Stack.  Recall  that  the  tag  of  the 
instruction  in  Slot0  is  compared  with  tags  on  the 
Interrupt  Bus.  Slot0  detects  a  tag  match  with  a 
branch  instruction  and  knows  that  this  is  a 
branch  prediction  interrupt.  The  contents  of  the 
A-Regs  are  transferred  to  the  W-Regs  and  all 
instructions  in  the  IU  are  invalidated.  After  the 
transfer,  the  A-Regs  and  the  W-Regs  are  in  the 
state  that  they  would  have  in  a  conventional 
machine  that  executed  instructions  up  to  but  not 
including  qB.. 

An  instruction  that  writes  to  the  destination 
register  of  a  preceding  completed  instruction  in 
the  IU  is  not  issued.  This  dependency  control  is 
part  of  the  precise  interrupt  scheme  adopted 
(Section  3)  and  prevents  the  overwriting  of  a 
W-Reg  before  its  contents  are  transferred  to  the 
corresponding  A-Reg. 

Table  I:  Instruction  execution  times. 


Instruction 

Type 

Base 

Machine 

FDS 

Store 

1  Cycle 

2  Cycles 

Load 

2  Cycles 

4  Cycles 

Branch 

2  Cycles 

3  Cycles 

Integer 

1  Cycle 

1  Cycle 

FI.  Pt. 

1  Cycle 

1  Cycle 

The  instruction  squashing  capability 
presented  supports  the  multiple,  out-of-order 
issue  of  instructions  preceding  and  following 
multiple  predicted  conditional  branch 
instructions.  Useful  work  is  not  undone  in  the 
process.  The  transfer  of  A-Regs  to  W-Regs  occurs 
in  one  cycle  when  qB  reaches  Slot0.  The  effects  of 
the  execution  of  instructions  that  followed  qB  are 
thus  eliminated  in  one  cycle.  Cycles  expended 
while  qB  moves  to  Slot0  are  productive  because 


useful  instructions  preceding  qB  are  executing. 
These  instructions  are  issued  in  multiples  and 
out-of-order  as  dependencies  allow. 

5.  Measurements 

We  use  14  Livermore  Loops  and  the 
Dhrystone  benchmarks  to  study  the  FDS 
behavior.  Two  traces  of  the  14  Livermore  Loop 
benchmarks  are  used,  LL_16  and  LL_32,  for  16 
and  32  register  CPUs  respectively.  The  Dhrystone 
benchmark  trace  is  for  a  32-register  CPU. 

Throughput  comparisons  are  made  with  a 
pipelined  "Base  Machine",  which  issues  at  most 
one  instruction  per  cycle,  in  order,  to  one  FU.  The 
14  Livermore  Loops  throughput  is  the  harmonic 
mean  of  the  individual  loop  throughputs. 

Table  2:  Benchmark  throughputs  on  FDS 
systems  with  register  renaming,  precise 
interrupts  and  branch  prediction  over  a  range  of 
branch  prediction  accuracies  (P.A.s). 


P.A. 

Issue  Unit  Stack  Size 

d 

8 

12 

16 

32 

LL.16  | 

50% 

0.91 

1.32 

1.60 

1.73 

1.91 

86% 

0.92 

1.36 

1.62 

1.78 

1.95 

100% 

0.92 

1.39 

1.63 

1.80 

1.97 

LL_ 

32 _ 1 

50% 

0.91 

1.32 

1.60 

1.73 

1.94 

85% 

0.92 

1.36 

1.62 

1.78 

1.98 

100% 

0.92 

1.39 

1.63 

1.80 

2.00 

Dhrye 

tone 

50% 

0.70 

0.91 

1.00 

1.04 

1.05 

85% 

0.71 

1.04 

1.15 

1.19 

1.21 

100%- 

0.72 

1.08 

1.19 

1.22 

1-24 

LL_ 

L6:  0.72, 

Base  M 

LL  32:  C 

achine 
.72,  Dhr 

ystone  0.65 

Bate*  BP  Machine  (100%  Prediction  Accuracy) 
LL_16  0.79,  LLJ2:  0.79,  Dhrystone  0.78 
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Instruction  execution  times  of  the  Base 
Machine  and  the  FDS  are  given  in  Table  1.  They 
include  the  time  necessaiy  to  eliminate  the 
dependencies  an  instruction  may  inflict  on 
following  instructions. 


— *-  100%  PA  — ~-B0%  PA 
— 90*  PA  — SO*  PA 
— — -  85%  PA  — • —  No  Branch  Prediction. 

No  Preciaa  Interrupts. 
Tbtal  Comprowton 


Issue  Unit  Stack  Size 


Figure  9:  Dhrystone  benchmark  throughputs  on 
FDS  systems  with  register  renaming. 

Benchmark  throughputs  on  FDS  systems 
with  register  renaming,  precise  interrupts,  and 
branch  prediction  mechanisms  with  various 
prediction  accuracies  (PAs)  are  given  in  Table  2. 
These  systems  have  unlimited  numbers  of  FUs. 
Included  in  Table  2  are  throughputs  measured  on 
the  Base  Machine  (Base)  and  on  the  Base 
Machine  with  100%  PA  (Base+BP).  These 
machines  are  identical  except  for  their  branch 
instruction  execution  times. 


a  FDS  system  with  4  slots  and  a  100%  PA  (0.72). 
In  this  instance,  a  conventional  machine  with  a 
shorter  branch  instruction  execution  time  has 
higher  throughput  than  the  FDS. 


— * —  LL-16,  renaming 

— * —  LL-16,  no  renaming 

■■■<»  LL-32.  renaming 

— LL-32.  no  renaming 

-  •  Dhrystone.  renaming 

— ■ —  Dhryrtoop.  no  renaming 

Figure  11:  Throughputs  on  FDS  systems  with 
register  renaming,  2  integer,  2  FI.  Pt.,  and  2  data 
units,  a  fetch  block  size  of  4,  and  4  IU  ports  with 
a  85%  PA. 

Figure  9  includes  a  plot  of  a  FDS  system  with 
total  compression  and  register  renaming  but 
without  precise  interrupts.  Recall  that  total 
compression  removes  all  completed  instructions 
from  the  IU  each  cycle.  We  see  that  increases  in 
Dhrystone  throughput  from  speculative  executions 
more  than  compensate  for  the  use  of  top 
compression  and  the  instruction  dependency 
imposed  by  the  squashing  capability. 


Figure  10:  LL_32  Benchmark  throughputs  on 
FDS  systems  with  register  renaming. 


The  Base+BP  machine  performance  on  the 
Dhrystone  benchmark  (0.78)  surpasses  that  on 


Table  3:  Percent  decrease  in  LL_16  throughput 
on  a  FDS  system  with  precise  interrupts 
compared  to  a  FDS  with  total  compression. 
Register  renaming  is  denoted  by  RR. 


Issue  Unit  Stack  Size 

4 

8 

12 

16 

32 

no 

RR 

10.2 

15.3 

13.2 

13.8 

14.3 

RR 

9.2 

17.5 

14.4 

14  4 

12.3 

LL_32  throughputs  on  systems  with  PAs  of 
50%,  857r  and  100%  are  plotted  in  Figure  10. 
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Here  we  plot  a  FDS  system  with  total 
compression  and  register  renaming  but  without 
precise  interrupts.  We  see  that  for  the  LL_32 
benchmark,  the  benefit  of  speculative  executions 
does  not  compensate  for  the  negative  effects  of 
the  squashing  scheme  adopted. 

Throughputs  on  FDS  systems  with  limited 
resources  are  plotted  in  Figure  11.  These 
systems  have  2  integer  and  2  FI.  Pt.  point  FUs, 
2  data  units,  a  fetch  block  size  of  4,  and  4  IU 
ports.  The  benefit  of  register  renaming  is 
apparent  in  this  plot. 

The  cost  of  fast,  precise  interrupts  in  a  FDS 
system  may  be  expressed  as  a  decrease  in 
throughput  compared  to  a  system  with  total 
compression.  We  see  (Table  3)  that  it  is  less 
than  15%  in  systems  with  12  or  more  IU  slots. 

6.  Conclusions 

We  have  presented  a  mechanism  —  Fast 
Dispatch  Stack  (FDS),  which  performs  in  an 
integrated  fashion  the  following  tasks, 
indispensable  for  a  superscalar  processor: 

a.  the  detection  and  dispatching  of  multiple 
instructions,  possibly  out-of-order,  to 
available  functional  units; 

b.  the  implementation  of  fast,  precise 
interrupts; 

c.  the  implementation  of  a  "squashing" 
capability  so  that  speculative  instruction 
execution  along  predicted  paths  can  be 
undertaken  without  attendant 
performance  penalty. 

We  evaluated  the  design  trade-offs  and  the 
performance  of  the  resulting  superscalar 
processor  with  extensive  simulations.  The  results 
are  presented. 

We  expect  that  the  FDS  we  have  developed 
can  be  extended  to  process  established  complex 
instruction  sets,  such  as  DEC  Vax  780,  IBM 
System  390,  and  Intel  x86.  Furthermore,  we 
expect  to  study  the  interaction  between  compilers 
and  the  FDS  and  to  study  its  behavior  on  other 
benchmarks. 


Acknowledgements 

The  research  reported  herein  has  been 
supported  in  part  by  the  Joint  Services 
Electronics  Program,  Contract  Number  F49620- 
90-C-0039. 


References 

[1]  R.D.  Acosta,  J.  Kjelstrup,  and  H.C.  Torng, 
"An  Instruction  Issuing  Approach  to 
Enhancing  Performance  in  Multiple 
Functional  Unit  Processors”.  IEEE 
Transactions  on  Computers ,  Vol.  C-35,  No. 
9,  Sept.  1986,  pp.  815-828. 

[2]  H.  Dwyer,  "A  Multiple,  Out-of-Order, 
Instruction  Issuing  System  for 
Superscalar  Processors",  Ph.D.  Thesis, 
School  of  Electrical  Engineering,  Cornell 
University,  1991. 

[3]  W.W.  Hwu  and  Y.N.  Patt,  "Checkpoint 
Repair  for  High-Performance  Out-of-Order 
Execution  Machines".  IEEE  Transactions 
on  Computers,  Vol.  C-36,  No.  12,  Dec. 
1987,  pp.  1496-1514. 

[4]  R.M.  Keller,  "Look-Ahead  Processors". 
Computing  Surveys,  Vol.  7,  No.  4,  Dec. 
1975,  pp.  177-195. 

[5]  J.K.F.  Lee  and  A.J.  Smith,  "Branch 
Prediction  Strategies  and  Branch  Target 
Buffer  Design".  IEEE  Computer,  Jan. 
1984,  pp.  6-22. 

[6]  S.  McFarling  and  J.  Hennessy,  "Reducing 
the  Cost  of  Branches".  Proceedings,  13th 
Annual  Symposium  on  Computer 
Architecture,  June  1986,  pp.  396-404. 

[7]  J.E.  Smith  and  A.R.  Pleszkun, 
"Implementation  of  Precise  Interrupts  in 
Pipelined  Processors".  IEEE  Transactions 
on  Computers,  Vol.  37,  No.  5,  May  1988, 
pp.  562-573. 

[8]  G.S.  Sohi,  "Instruction  Issue  Logic  for 
High-Performance,  Interruptable,  Multiple 
Functional  Unit,  Pipelined  Computers". 
IEEE  Transactions  on  Computers,  Vol.  39, 
No.  3,  Mar.  1990,  pp.  349-359. 

[9]  G.S.  Tjaden  and  M.J.  Flynn,  "Detection 
and  Parallel  Execution  of  Independent 
Instructions”.  IEEE  Transactions  on 
Computers,  Vol.  C-19,  Oct.  1970,  pp.  889- 
895. 


:xi 


1993  International  Phoenix  Conference  on 
Computers  and  Communications 


268 


» 


ON  INSTRUCTION  WINDOWING  FOR  FINE  GRAIN 
PARALLELISM  IN  HIGH-PERFORMANCE  PROCESSORS 


H.  C.  TORNG,  School  of  Electrical  Engineering,  Cornell  University 
Ithaca,  New  York  14853 

H.  DWYER,  IBM/Austin,  and  D.  MARR,  University  of  Michigan, 
formerly  at  Cornell  University 


ABSTRACT  Fine  grain  parallelism  is  an 
effective  approach  to  enhancing  processor 
performance  through  multiple  and  possibly  out 
of  order  instruction  issue  and  execution.  We 
define,  design  and  evaluate  a  basic  central 
window,  which  works  with  dynamic 
instruction  stream.  Several  schemes  are 
presented  to  reduce  the  window's  potential 
impact  on  processor  cycle  time  and  its 
hardware  cost.  Finally,  we  show  that  a 
central  window  can  function  effectively  as  a 
buffer  for  speculative  execution  and  for 
handling  interrupts  and  exceptions. 


I.  INTRODUCTION 

The  drive  to  enhance  processor  performance 
and  the  advances  in  device  technology  have 
led  designers  to  explore  various  instruction 
issuing  schemes. 

Most  processors,  including  360/9111]  and  CRAY 
machines  [2],  examine  one  instruction  at  a  time. 
If  that  instruction  is  free  of  data  and  resource 
dependencies,  it  is  issued;  otherwise,  the  issue 
process  stops  until  the  relevant  dependencies 
have  been  resolved.  Consequently,  at  most  one 
instruction  is  issued  per  cycle. 

To  enhance  processor  performance,  designers 
have  been  pursuing  among  other  things  fine 
grain  parallelism  in  instruction  issuances;  this 
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entails  the  following: 

1.  issue  more  than  one  instruction  at  a 
machine  cycle  —  multiple  instruction  issuance; 

2.  issue  instructions  out  of  program 
sequence  -  out  of  order  instruction  issuance. 

In  order  to  realize  the  intrinsic  potential  for 
multiple  and  out-of-order  instruction  issuance 
for  a  given  set  of  programs,  the  designers  have 
to  endow  processors  with  the  capacity  to 
detect  execution  concurrencies  that  exist  among 
instructions  in  an  instruction  stream. 
Specifically,  instructions  in  an  instruction 
stream  can  be  issued  concurrently  and/or  out- 
of-order,  if  it  can  be  established  that  there 
exist  no  hazards  [3, 4]  due  to: 

1.  resource  conflicts; 

2.  data  dependences; 

3.  control  dependences. 

The  detection  of  instructions  for  multiple  and 
out-of-order  issuances  is  an  important  task  in 
the  design  of  high-performance  processors;  we 
addressed  it  in  this  paper. 

Section  II  discusses  briefly  static  means  to 
achieve  multiple  and  out  of  order  instruction 
issuance.  We  will  introduce  the  instruction 
windowing  mechanism,  which  extracts 
instructions  for  concurrent  issuance  in  Section 
III.  The  basic  implementation  issues  are 
explored  in  Section  IV.  In  Section  V,  we 
present  modifications  to  the  basic  realization 


schemes.  The  use  of  an  instruction  window  to 
support  speculative  execution  is  outlined  in 
Section  VI.  Section  VII  presents  two  window 
aided  approaches  to  handling  interrupt. 
Concluding  remarks  are  presented  in  Section 
VIII. 


II.  COMPILERS  FOR  FINE  CRAIN 
PARELLISM 

The  extraction  of  instructions  that  can  be 
issued  concurrently  can  be  performed 
statically  at  compile  time.  It  may  come  under 
the  name  of  "program  restructuring"  (5).  In  the 
"very  long  instruction  word"  (VLIW) 
approach,  possible  concurrent  operations  are 
identified  and  packed  into  instructions.  It  is 
advantageous  to  put  as  many  operations  into 
an  instruction  as  possible.  In  order  to  specify 
many  operations  with  one  instruction,  an 
instruction  will  have  many  fields  and  thus 
become  very  long  [6, 7]. 

At  compile  time,  the  "scope"-  the  number  of 
instructions  that  can  be  examined  at  the  same 
time-  is  relatively  large.  One  such  technique 
is  Trace  Scheduling  {8].  A  trace  represents  a 
path,  which  may  encompass  several  basic 
blocks.  Instructions  in  these  blocks  can  be 
moved  and  packed  into  very  long  instruction 
words.  Since  there  are  many  traces  possible  for 
a  given  program,  only  those  with  "high" 
probabilities  of  occurrence  are  processed.  In 
executing  a  trace,  we  run  the  risk  that  it  may 
have  to  be  aborted  due  to  the  fact  one  of  the 
conditional  branches  does  not  produce  the 
corresponding  path.  Provisions,  which  may  be 
costly,  have  to  be  made  in  the  specific  trace  to 
ensure  semantic  correctness  for  the  program. 

Various  schemes  for  constructing  VLIW  have 
been  reported;  see  for  example  (9,  10].  It  is 
difficult  for  machines  which  rely  entirely  on 
compiler  technology  to  extract  fine  grain 
parallelism  due  to  variable  memory  latencies 
and  other  variable  delays.  In  addition, 
dynamic  branch  prediction  is  frequently  more 
accurate  than  static  branch  prediction. 
Although  it  is  essential  to  have  good 
optimizing  compilers,  it  is  equally  important 
to  have  good  dynamic  instruction  scheduling. 


III.  INSTRUCTION  WINDOWING 

Due  to  resource  and  data  considerations,  we 
cannot  expect  to  extract  all  the  concurrencies 
with  static  means.  Dynamic  scheduling  must 
be  examined.  Some  recent  machines  111,  12] 
examine  and  issue  multiple  instructions 
concurrently  per  machine  cycle,  limited  by  the 
order  and  mix  of  instructions  in  the  dynamic 
instruction  stream.  Multiple  instructions  are 
issued  only  when  3  or  4  consecutive  instructions 
are  of  specific  types;  such  restrictions  severely 
limit  the  utilization  of  the  hardware 
resources  of  the  processor,  and  degrade  its 
performance. 

Interesting  studies  on  dynamic  instruction 
scheduling  have  been  actively  pursued;  see  for 
example  [13]. 

In  a  classical  processor,  only  the  instruction  at 
the  head  of  an  instruction  stream  is  examined 
by  the  instruction  unit.  It  can  be  said  that  it 
constitutes  a  window  with  a  size  of  one 
instruction.  In  order  to  extract  fine  grain 
paralleism  from  a  dynamic  instruction  stream, 
we  believe  that  we  have  to  endow  processors 
with  the  ability  to  look  at  more  instructions 
at  any  given  time;  thus  the  concept  of 
"windowing".  A  processor,  through  an 
instruction  window,  can  extract  multiple,  and 
possibly  out  of  order,  instructions  for  issuance. 

It  can  be  said  that  processors  have  always 
had  an  instruction  window.  We  simply 
propose  that  the  size  of  the  window  be 
increased  from  one  to  an  integer  much  greater 
than  one.  A  processor,  being  able  to  see  more, 
should  be  able  to  "do”  more. 


IV.  WINDOW  IIMPLEMENTATON 

The  general  organization  of  a  processor  is 
shown  in  Fig.  1  [14].  Note  the  presence  of 
multiple  functional  units  in  a  processor.  As 
technology  allows  us  to  fabricate  with 
increasing  number  of  transistors  per  chip, 
adding  multiple  functional  units  becomes 
increasingly  commonplace.  A  basic  instruction 
window,  called  a  Dispatch  Stack  (DS),  is 
shown  in  Fig.  2  [14];  it  contains  essentially  a 


stack  of  registers  (slots),  with  each  housing 
one  instruction. 

The  DS  contains  instructions  which  are  either 
waiting  to  be  issued,  being  executed,  or 
waiting  to  be  removed  after  execution 
completion. 
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Fig.  1.  General  Organization  of  a  Processor 
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Fig.  2.  A  Basic  Instruction  Window 

When  one  completes,  the  instructions  below 
are  moved  up  to  fill  the  vacated  slots, 
creating  space  at  the  bottom  for  new 
instructions  to  be  brought  in  from  the  an 
Instruction  Cache,  or  an  Instruction  Buffer 
Unit. 


Instructions  are  selected  for  issue  from  the  DS 
with  the  instructions  toward  the  top  having 
higher  priority  than  those  below 

In  its  most  elementary  implementation, 
instructions  are  brought  into  the  DS  when 
conditional  branches  have  been  resolved  An 
unresolved  branch  may  halt  the  supply  of 
instructions  to  the  DS. 

Data  dependencies  among  instructions  in  the 
DS  are  kept  with  "counters"  for  each 
instruction  [14].  Combinational  logic  circuits 
are  used  to  update  the  counters.  If  no 
dependencies  exist  -  all  associated  counters 
contain  0  for  an  instruction,  it  is  considered 
independent  and  can  be  issued.  Several 
instructions  may  be  issued  concurrently,  the  DS 
directs  the  routing  of  selected  instructions  to 
available  functional  units. 

In  evaluating  such  a  window,  we  pay 
attention  to  several  items: 

the  implementation  cost  ~  how  much 
chip  area  would  this  mechanism  consume? 
This  concern  has  been  lessened  due  to  advances 
in  chip  technology; 

the  performance  cost  --  since  the 
circuits  are  needed  to  update  the  dependence 
counts,  issue  independent  instructions,  remove 
completed  instructions,  and  finally  bring  in 
new  instructions,  these  circuits  mav  have  an 
adverse  effect  on  the  clock  rate.  To  put  it 
differently:  if  we  have  to  increase  the 
processor  cycle  time,  the  performance  gam  due 
to  multiple  instruction  issue  may  be  nullified. 

V.  NEW  IMPLEMENTATION  SCHEMES 

To  address  the  concerns  on  its  potential 
adverse  impact  on  cycle  time,  we  have 
developed  several  new  schemes:  the  use  of  bit 
vectors  [15,  16);  the  use  of  pointers;  and 
finally  a  block  based  window  (17J, 

Bit  Vector 

In  using  bit  vectors,  each  instruction  is 
represented  with  an  "1-Group”,  which  consists 
of  a  tag,  a  type  vector,  the  instruction  itself,  a 


read-vector,  and  a  write-vector;  one  such 
group  is  shown  in  Fig.  3. 

A  tag  is  a  bit  vector  with  one  and  only  one  bit 
set  to  1;  each  tag  identifies  an  instruction 
uniquely. 
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Fig.  3.  An  Illustrative  I-Group 

The  type  vector  specifies  an  instruction's  type 
with  one  position  for  each  type. 

In  the  read-vector,  each  bit  is  assigned 
exclusively  to  an  architected  register.  A  bit  in 
the  vector  is  set  to  one  if  and  only  if  its 
corresponding  register  provides  an  operand  for 
the  intended  operation.  The  write-vector  can 
be  explained  similarly. 

The  formulation  of  the  I-Groups  facilitates 
the  detection  of  data  dependencies  among 
instructions,  the  removal  of  completed 
instructions,  and  the  dispatching  of 
instructions  to  functional  units. 

For  an  8-slot  DS  with  a  set  of  16  architected 
registers  and  4  instruction  dispatching  ports, 
Dwyer  [16]  performed  a  detailed  design  and 
found  that  the  critical  path  imposes  a  delay 
of  16  gates. 

Pointers 

Instead  of  shifting  instructions  in  the  DS  to 
maintain  the  proper  order  of  appearances 
among  them,  Marr  [17]  proposed  that  two 
pointers,  head  and  tail,  be  used.  This  brings 


about  considerable  reduction  in  circuit 
complexity  for  the  DS. 

There  are  however  some  complications.  One  of 
these  is  that  the  head  pointer  can  be  moved  if 
and  only  if  a  contiguous  set  of  instructions, 
including  the  top  one,  have  been  completed; 
this  is  termed  "top  compression"  by  Dwyer 
and  the  performance  degradation  due  to  this 
restriction  is  not  significant  [16).  In  addition  to 
accommodating  the  installation  of  a  head 
pointer,  top  compression  brings  with  it 
additional  advantages,  which  will  be 
discussed  in  Sections  VI  and  VII. 

The  tail  pointer  indicates  the  last  instruction 
entered  into  the  window.  Every  time  new 
instructions  are  placed  into  the  window,  the 
tail  pointer  is  moved  down.  If  the  window’  is 
full,  then  the  tail  pointer  must  wait  until  the 
head  pointer  moves  to  give  room  to  place  new 
instructions. 

The  result  of  this  is  that  the  window  can 
become  "fragmented",  meaning  that  although 
there  may  be  completed  instructions  in  the 
window,  new  instructions  cannot  be  placed  into 
the  window  until  those  at  the  head  are 
retired. 

Organizing  Slots  into  Blocks 

To  reduce  circuit  complexity  for  the  DS  even 
more,  Marr  [17]  proposed  that  a  fixed  number, 
which  can  be  1,  2,  4, ...,  of  contiguous  slots  be 
organized  into  blocks.  The  head  pointer  is 
moved  only  when  the  instructions  in  the  top 
block  are  completed.  Similarly,  the  tail 
pointer  is  moved  when  new  instructions  can  be 
brought  into  the  window.  The  advantage  of 
organizing  instruction  slots  into  blocks  is  that 
being  able  to  do  things  a  "block"  at  a  time 
enables  considerable  saving  in  instruction 
fetch,  dependency  evaluation,  instruction 
issue  and  replacements. 

The  organizational  diagram  of  a  block-based 
instruction  window  is  shown  in  Fig.  4.  A  new 
window  organization,  incorporating  the 
pointer  and  block  concepts,  can  be  found  in  [17). 
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Fig.  4.  A  Block-based  Instruction  Window 

VI.  CONDITIONAL  BRANCH  HANDLING 

One  of  the  vexing  problems  in  processor  design 
is  the  handling  of  conditional  branches.  For 
processors  employing  instruction  windowing, 
the  simple  fact  is  that  we  want  a  large  number 
of  instructions  in  the  window  to  provide  more 
instruction  level  parallelism.  Conditional 
branches  often  make  it  difficult  to  keep  the 
window  full. 


We  now  require  that  a  rev-  instruction 
removal  mode  be  instituted:  only  those 
instructions  at  the  top  of  the  window  may  be 
retired;  multiple  instructions  are  retired  at 
once  if  they  form  a  contiguous  sequence  of 
instructions  at  the  top  of  the  window'.  This 
instruction  retirement  mode  is  called  top 
compression  (16). 

When  the  prediction  made  for  a  conditional 
branch  is  found  to  be  correct,  it  can  be  removed 
from  the  instruction  window  when  it  is 
included  in  a  contiguous  segment  of  completed 
instructions,  including  the  top  one  in  the 
window.  Again,  note  that  when  an  instruction 
is  retired  from  the  window-,  its  result  is  made 
permanent  by  being  copied  into  an  architected 
register. 

When  the  prediction  made  for  a  conditional 
branch  is  found  to  be  incorrect,  it  will  not  be 
retired  from  the  instruction  window.  All 
instructions  which  follow  the  branch  are 
removed  from  the  window.  When  the  branch 
is  retired  from  the  window,  the  contents  of  the 
architected  registers  are  copied  into  the 
working  registers.  The  instruction  window  is 
then  filled  with  instructions  from  the  correct 
path.  Fig.  5  provides  a  schematic  diagram. 


Branch  prediction  techniques  have  been 
developed  [18,  19]  to  fetch  and  to  execute 
speculatively  along  a  likely  path.  Even 
though  the  achieved  prediction  accuracy  can 
reach  80%  to  98%,  some  guesses  will  prove  to 
be  wrong.  When  a  branch  is  incorrectly 
predicted,  changes  in  machine  state  need  to  be 
removed.  Doing  so  may  involve  a  penalty. 
Instruction  windowing  can  make  an  important 
contribution  in  this  area  [15, 16]. 

Conditional  branch  instructions  along  with 
those  instructions  on  the  predicted  branch 
path  are  brought  into  the  instruction  window 
to  be  executed  speculatively.  The  execution 
results  are  written  into  an  additional  set  of 
registers,  called  the  "working  registers"  for 
temporary  storage  and  access  by  subsequent 
instructions.  These  results  are  copied  into  the 
"architected  registers"  once  the  instructions 
are  retired  from  the  window. 


Fig.  5.  A  Schematic  Diagram 
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Although  the  instruction  window  does  not 
improve  branch  prediction  accuracy,  it  can 
help  to  mitigate  the  performance  degradation 
due  to  incorrect  branch  predictions. 

VII.  INTERRUPT  HANDLING 

It  is  imperative  that  processors,  especially 
high  performance  processors  which  execute 
instructions  concurrently  and  possibly  out  of 
order,  handle  interrupts  and  exceptions 
properly  and  efficiently.  In  this  section,  we 
present  two  window  aided  approaches  to 
handling  interrupts. 

The  central  point  in  interrupt  handling  is  the 
definition  of  the  "interrupt  point".  The 
interrupt  point  is  defined  to  be  the  instruction 

a,  such  that  all  instructions  but  not  including 

instruction  a  are  completed.  The  interrupted 
program  will  then  resume  precisely  at 

instruction  Ct.  An  instruction  window  can  be 
used  advantageously  to  implement  this  view 
point  116]. 

One-instruction  Interrupt  Point 

Consider  the  structure  depicted  in  Fig.  5.  The 
interrupt  point,  is  identified  in  the  window. 
When  it  reaches  the  top  slot,  the  contents  of 
the  architected  registers  are  stored  as  part  of 
the  processor  state  and  the  window  is  flushed 
to  received  instructions  for  the  process  which 
services  the  requested  interrupt. 

It  is  to  be  noticed  that  such  a  processor 
executes  multiple  instructions  concurrently  and 
possibly  out-of-order.  However,  when  an 
interrupt  is  requested,  a  one-instruction 
interrupt  point  can  be  clearly  implemented. 
Smith  and  Pleszkun  have  proposed  a  specific 
buffer  to  perform  the  same  function  [20];  here 
the  instruction  window,  in  addition  to 
multiple  and  out  of  order  instruction 
dispatching  and  other  services,  does  it 
without  appreciable  extra  cost. 

Multi-instruction  Interrupt  Point 


In  evaluating  an  interrupt  handling  scheme, 
we  have  to  consider  three  factors:  latency, 
component  cost,  and  performance  degradation 
[21].  One  will  notice  that,  in  implementing  a 
one-instruction  interrupt  point,  the  processor 
has  to  wait  for  the  completion  of  all  the 
instructions  that  precede  the  interrupt  point  to 
complete;  it  takes  time.  Furthermore,  all 
instructions  that  follow  the  interrupt  point, 
some  of  which  may  have  already  been 
completed,  have  to  be  discarded. 

With  an  instruction  window  available,  Tomg 
and  Day  [21]  have  developed  an  alternative 
approach:  the  instruction  window  is  included 
as  a  component  of  the  processor  state;  the 
saved  contents  of  the  window  provides  a 
modified  interrupt  point.  A  group  of 
instructions  in  the  window  jointly  define  an 
interrupt  point,  where  the  interrupted 
processing  should  resume. 

VIII.  CONCLUDING  REMARKS 

With  the  increased  and  ever  increasing  device 
density,  it  is  now  feasible  to  implement  an 
instruction  window  for  high  performance 
processors.  The  windowing  technique  will 
enable  multiple  and  out-of-order  instruction 
issuance;  provide  indispensable  support  for 
speculative  execution;  and  implement  precise, 
responsive,  flexible  interrupt  handling. 

Furthermore,  these  can  be  achieved  without 
increasing  the  length  of  the  critical  path, 
which  in  turn  determines  a  realistic  processor 
dock  rate. 
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